Closed ludi317 closed 11 months ago
You can try using go build -gcflags=all=-d=softfloat
, which should make all compiled code using softfloat. There might be some assembly code that uses floating point, which you might need to rewrite.
In triage, we think this needs to be a proposal. Since this isn't explicitly supported (and we don't have hardware for CI to test this configuration, or a test to make sure there aren't any FP instructions when setting the softfloat configuration) we'd have to make a decision to support it.
Change https://go.dev/cl/514907 mentions this issue: all: add GOARMFP env var for ARM floating point mode
In this comment, another user is forced to downgrade to GOARM=5
on an ARMv7 chip just to get soft floating point (#58686 comment).
An ARMv7 chip should execute ARMv7 instructions. Anything less leaves the CPU underutilized, and is a waste of resources.
To support this proposal, I have submitted a CL that can build GOARM=7
and GOARMFP=soft
. Even if this proposal is not approved, I would greatly appreciate a review, or any feedback, on the CL. Thanks.
@ludi317 Have you tried the compiler flag -gcflags=all=-d=softfloat
? If there is some assembly code that needs to be adjusted we could introduce a macro like -D softfloat
that you can pass as -asmflags
.
If we really want an environment variable for the go command, my counter proposal: use an existing variable, either GOARM=7,softfloat
(see also #60072), or GOEXPERIMENT=softfloat
(our softfloat implementation is largely architecture independent (except a small amount of assembly code), so may as well use an architecture independent flag).
@cherrymui I did try building with the compiler flag -gcflags=all=-d=softfloat
(and commenting out this check). Unfortunately, the binary crashed with signal SIGILL. The assembly code does indeed need to be modified in a few places, as seen in my CL.
I am not particular about the API used to specify soft float for ARM, as long as there is one. If I were to choose, I'd suggest that if GOMIPS64
accepts a comma-separated list of options (as proposed in #60072), then it would make sense for GOARM
to do the same. Your proposal to use GOARM=7,softfloat
seems very reasonable.
Can you tell us what this chip is that is armv7 but without floating point? I am curious.
Since you have the change prototyped, what performance differences are you seeing between GOARM=5
and GOARM=7,softfloat
? In the compiler at least, the differences I see are mostly bit manipulation instructions (find first bit, etc.). There may be some more in the runtime (memmove?).
Can you tell us what this chip is that is armv7 but without floating point? I am curious.
The Aspeed AST2500 for example is a chip that supports the armv6k instruction set but does not have a floating point unit so we need to fall back to GOARM=5
for that one. Another one is the Broadcom BCM4708A0 armv7 SoC that lacks floating point hardware. In general a lot of the cheaper WiFi/AP/Network appliances or deeply embedded SoCs often come without an fpu as it's often times not really needed for the limited usecase of the system.
Can you tell us what this chip is that is armv7 but without floating point? I am curious.
The chip is a BCM56160, and is found in a network switch. sysctl
shows that the CPU is an ARM Cortex-A9, without an FPU:
root@martini48t-p2a-sys04:RE:0% sysctl hw.model hw.floatingpoint
hw.model: ARM Cortex-A9 r4p1 (ECO: 0x00000000)
hw.floatingpoint: 0
Since you have the change prototyped, what performance differences are you seeing between
GOARM=5
andGOARM=7,softfloat
?
I never measured the performance of our Go program when GOARM=5
. Since the network switch is already CPU-bound, I was concerned that downgrading would only hurt performance.
In the compiler at least, the differences I see are mostly bit manipulation instructions (find first bit, etc.). There may be some more in the runtime (memmove?).
Yes, the runtime leverages ARMv7 features. One example is that when GOARM=7
, the runtime opts for ARM-specific atomic operations (armCas64
, armXadd64
, armXchg64
, armLoad64
, armStore64
). https://github.com/golang/go/blob/2d5ce9b729c0edded841301bd73d68d5e95aa28b/src/runtime/internal/atomic/atomic_arm.s#L249-L253
FWIW, the prototype has matured into a feature implementation that takes GOARM=7,softfloat
as an argument. Using this new option, we have built binaries that work as expected on the switch. Please see the CL for the implementation.
Finally, I came across a comment from Russ Cox indicating that back in 2011, Go supported software floating point for GOARM > 5, by setting the -F
flag.
Finally, I came across a comment from Russ Cox indicating that back in 2011, Go supported software floating point for GOARM > 5, by setting the -F flag.
The softfloat support in Go has been reworked since then. We used to handle it in the linker (5l at the time), at instruction level, which means it would also handle (Go) assembly code (but not cgo). Now we handle it in the compiler, with -gcflags=-d=softfloat
, which means it doesn't handle assembly code. So we need a way for that.
I'd really like to see some performance numbers of the difference between GOARM=5
and GOARM=7,softfloat
. If there is little or no difference the whole point of this proposal is kind of moot.
It doesn't have to be on these strange chips. Any GOARM=7
capable chip could run some benchmarks in both modes and see. (You'd need to patch in the proposed CL for 7,softfloat
support.)
@randall77 Please find the requested benchmarks comparing GOARM=5
and GOARM=7,softfloat
below. Full source code here.
The benchmarks show many significant performance improvements, and only a few minor degradations. On the AtomicOperationsInt64
benchmark, GOARM=7,softfloat
is more than 3x faster than GOARM=5
.
goarch: arm
pkg: github.com/ludi317/arm-wrestle
│ armv5_1cpu_raw.txt │ armv7soft_1cpu_raw.txt │
│ sec/op │ sec/op vs base │
Float32Arithmetic 4.944µ ± 1% 4.678µ ± 0% -5.37% (p=0.002 n=6)
Int32Arithmetic 15.67n ± 3% 15.65n ± 0% ~ (p=0.318 n=6)
Float64Arithmetic 3.905µ ± 0% 3.876µ ± 0% -0.74% (p=0.002 n=6)
Int64Arithmetic 29.06n ± 0% 29.07n ± 0% +0.03% (p=0.015 n=6)
ANDconstBICconst 52.53n ± 0% 52.55n ± 0% +0.03% (p=0.035 n=6)
Uint64Move 22.35n ± 0% 22.36n ± 0% ~ (p=1.000 n=6)
ADD 1.049µ ± 0% 1.009µ ± 0% -3.81% (p=0.002 n=6)
ADDBICconst 20.12n ± 0% 19.00n ± 0% -5.57% (p=0.002 n=6)
ADDBICconstInt64 29.07n ± 0% 27.94n ± 0% -3.87% (p=0.002 n=6)
WithMulDAndMulF 1029.0n ± 0% 986.2n ± 0% -4.16% (p=0.002 n=6)
BitwiseInt32 8.942n ± 0% 8.942n ± 0% ~ (p=0.773 n=6)
BitwiseInt64 13.42n ± 0% 13.42n ± 0% ~ (p=1.000 n=6)
TrailingZeros 43.59n ± 0% 30.18n ± 0% -30.76% (p=0.002 n=6)
ProducerConsumerBufferedCh 3.894µ ± 0% 3.603µ ± 0% -7.46% (p=0.002 n=6)
ProducerConsumerBufferedChInt64 3.961µ ± 1% 3.631µ ± 0% -8.33% (p=0.002 n=6)
ProducerConsumerUnBufferedCh 5.099µ ± 0% 4.701µ ± 0% -7.81% (p=0.002 n=6)
ProducerConsumerUnBufferedChInt64 5.073µ ± 0% 4.634µ ± 0% -8.65% (p=0.002 n=6)
GetCntxct 3.851µ ± 0% 3.578µ ± 0% -7.10% (p=0.002 n=6)
CASInt32 158.9n ± 0% 160.9n ± 0% +1.26% (p=0.002 n=6)
CASInt64 502.1n ± 0% 157.3n ± 3% -68.66% (p=0.002 n=6)
CASUint64 502.1n ± 0% 157.5n ± 0% -68.64% (p=0.002 n=6)
CASUint32 158.9n ± 0% 166.7n ± 0% +4.91% (p=0.002 n=6)
CASUintptr 158.9n ± 0% 167.8n ± 3% +5.60% (p=0.002 n=6)
AtomicOperationsInt64 931.1n ± 0% 268.6n ± 0% -71.15% (p=0.002 n=6)
AtomicOperationsInt32 306.6n ± 0% 297.6n ± 0% -2.92% (p=0.002 n=6)
AtomicOperationsUint64 928.8n ± 0% 270.8n ± 0% -70.84% (p=0.002 n=6)
AtomicOperationsUint32 306.6n ± 0% 297.6n ± 0% -2.92% (p=0.002 n=6)
AtomicOperationsUintptr 308.8n ± 0% 304.4n ± 0% -1.42% (p=0.002 n=6)
AtomicOperationsBool 537.1n ± 0% 494.5n ± 0% -7.93% (p=0.002 n=6)
geomean 300.4n 245.5n -18.28%
So it looks like math/bits
and 64-bit atomics are the regressions.
The math/bits
one is pretty minor, GOARM=5
is missing the RBIT
instruction so getting trailing bits takes 2 more instructions. I think ReverseBytes
is similar. (Reverse32
should be a lot faster on GOARM=7
, but no one has optimized that function to use RBIT
.)
The 64-bit atomic costs are more substantial. The arm atomics already do a runtime check, but they just use the GOARM
value the binary was built with. If we can detect the presence of the atomic instructions we need (LDREXD
/STREXD
, maybe also DMB
?) at runtime, then we can base the runtime check on the actual hardware we're running on.
LDREXTD/STREXD
can be detected using the lpae
feature bit. (Particularly, detecting that they will be 64-bit atomic.)
It looks like we also need to make sure the DMB
instruction is available. It is only available starting in v7, so we need a way to detect that the chip is v7. Anyone know how to get that from feature bits? Currently we check vfp
and vfpv3
, but of course that's too strict if we're trying to run on fp-less chips.
@randall77 I thought the performance deltas in the channel-backed ProducerConsumer
benchmarks (-8%) were also interesting, even though they were not as large as those of the math/bits
and 64-bit atomic benchmarks.
Based on that finding, I wrote more benchmarks to compare the performance of synchronization primitives between the two builds. Please find the results below. The Mutex
benchmarks that acquire a mutex lock, do some work, then release the lock are ~2x faster on GOARM=7,softlfloat
.
goos: linux
goarch: arm
pkg: github.com/ludi317/arm-wrestle
│ armv5_1cpu_raw.txt │ armv7soft_1cpu_raw.txt │
│ sec/op │ sec/op vs base │
...
Mutex 44.94µ ± 0% 22.60µ ± 0% -49.70% (p=0.002 n=6)
RWMutex_Read 45.00µ ± 0% 22.65µ ± 0% -49.67% (p=0.002 n=6)
RWMutex_Write 45.22µ ± 0% 22.87µ ± 0% -49.42% (p=0.002 n=6)
WaitGroup 90.63m ± 4% 77.13m ± 4% -14.89% (p=0.002 n=6)
Channel 8.781m ± 0% 8.383m ± 0% -4.54% (p=0.002 n=6)
AtomicAdd 259.40n ± 0% 73.86n ± 0% -71.53% (p=0.002 n=6)
Once 67.11n ± 0% 64.87n ± 0% -3.33% (p=0.002 n=6)
Cond 11.126µ ± 0% 9.781µ ± 0% -12.09% (p=0.002 n=6)
Pool 774.5n ± 1% 723.2n ± 1% -6.62% (p=0.002 n=6)
I suspect that the channel differences are all due to the synchronization primitives that channels use, for which we know there is already a sizable performance difference.
Change https://go.dev/cl/525637 mentions this issue: runtime: on arm32, detect whether we have sync instructions
Is there any additional information needed to move this proposal to "Active" status? To summarize,
GOARM=5
to run Go binaries.GOARM=7,softfloat
. The format is consistent with the API defined in the accepted GOMIPS64 proposal, eg GOMIPS64=iii,softfloat
.GOARM=7,softfloat
as compared to GOARM=5
. GOARM=5
to leverage some ARMv7 features. It sounds like https://go.dev/cl/525637 is the right thing to try first, since it is not a visible API change and does not require a proposal at all. @ludi317 can you please rerun your GOARM=5 benchmarks with Keith's CL patched in?
@rsc Please find the requested benchmarks comparing GOARM=5
with Keith's CL applied and GOARM=7,softfloat
.
math/bits
benchmarks are up to 1.4x faster in GOARM=7,softfloat
goos: linux
goarch: arm
pkg: github.com/ludi317/arm-wrestle
cpu: ARMv7 Processor rev 5 (v7l)
│ raw/round2/armv5keith_1cpu_raw.txt │ raw/round2/armv7soft_1cpu_raw.txt │
│ sec/op │ sec/op vs base │
Float32Arithmetic 4.761µ ± 0% 4.715µ ± 0% -0.98% (p=0.002 n=6)
Int32Arithmetic 15.70n ± 0% 15.74n ± 0% +0.25% (p=0.002 n=6)
Float64Arithmetic 3.923µ ± 0% 3.898µ ± 0% -0.64% (p=0.002 n=6)
Int64Arithmetic 29.15n ± 0% 29.15n ± 0% ~ (p=0.396 n=6)
ANDconstBICconst 52.69n ± 0% 52.65n ± 0% ~ (p=0.266 n=6)
Uint64Move 22.42n ± 0% 22.42n ± 0% ~ (p=1.000 n=6)
ADD 1.060µ ± 0% 1.007µ ± 0% -5.09% (p=0.002 n=6)
ADDBICconst 20.18n ± 0% 18.95n ± 0% -6.12% (p=0.002 n=6)
ADDBICconstInt64 29.15n ± 0% 27.88n ± 0% -4.36% (p=0.002 n=6)
WithMulDAndMulF 1040.5n ± 0% 984.7n ± 0% -5.36% (p=0.002 n=6)
BitwiseInt32 8.968n ± 0% 8.919n ± 0% -0.55% (p=0.002 n=6)
BitwiseInt64 13.46n ± 0% 13.38n ± 0% -0.56% (p=0.002 n=6)
TrailingZeros 43.72n ± 0% 30.10n ± 0% -31.15% (p=0.002 n=6)
LeadingZeros 42.60n ± 0% 39.14n ± 0% -8.13% (p=0.002 n=6)
RotateLeft 114.4n ± 0% 111.0n ± 0% -2.97% (p=0.002 n=6)
OnesCount 150.2n ± 0% 144.7n ± 0% -3.73% (p=0.002 n=6)
ProducerConsumerBufferedCh 3.602µ ± 0% 3.480µ ± 0% -3.39% (p=0.002 n=6)
ProducerConsumerBufferedChInt64 3.687µ ± 0% 3.529µ ± 0% -4.29% (p=0.002 n=6)
ProducerConsumerUnBufferedCh 4.683µ ± 0% 4.492µ ± 0% -4.09% (p=0.002 n=6)
ProducerConsumerUnBufferedChInt64 4.653µ ± 0% 4.490µ ± 0% -3.50% (p=0.002 n=6)
GetCntxct 3.580µ ± 1% 3.496µ ± 1% -2.36% (p=0.002 n=6)
CASInt32 158.4n ± 0% 160.5n ± 0% +1.33% (p=0.002 n=6)
CASInt64 152.6n ± 2% 154.8n ± 0% +1.44% (p=0.015 n=6)
CASUint64 152.6n ± 1% 151.8n ± 1% ~ (p=0.117 n=6)
CASUint32 159.3n ± 0% 160.3n ± 0% +0.63% (p=0.002 n=6)
CASUintptr 159.3n ± 0% 164.8n ± 0% +3.45% (p=0.002 n=6)
AtomicOperationsInt64 269.4n ± 0% 270.0n ± 0% +0.24% (p=0.002 n=6)
AtomicOperationsInt32 307.5n ± 0% 296.8n ± 0% -3.50% (p=0.002 n=6)
AtomicOperationsUint64 269.3n ± 0% 267.7n ± 0% -0.59% (p=0.002 n=6)
AtomicOperationsUint32 307.5n ± 0% 296.9n ± 0% -3.46% (p=0.002 n=6)
AtomicOperationsUintptr 309.7n ± 0% 301.2n ± 1% -2.74% (p=0.002 n=6)
AtomicOperationsBool 539.7n ± 0% 498.2n ± 0% -7.68% (p=0.002 n=6)
Mutex 45.09µ ± 0% 22.68µ ± 0% -49.69% (p=0.002 n=6)
RWMutex_Read 45.13µ ± 0% 22.71µ ± 1% -49.69% (p=0.002 n=6)
RWMutex_Write 45.37µ ± 0% 22.90µ ± 0% -49.52% (p=0.002 n=6)
WaitGroup 77.64m ± 6% 77.01m ± 5% ~ (p=0.394 n=6)
Channel 8.816m ± 0% 8.344m ± 1% -5.35% (p=0.002 n=6)
AtomicAdd 71.80n ± 0% 73.67n ± 0% +2.61% (p=0.002 n=6)
Once 67.31n ± 0% 64.98n ± 0% -3.45% (p=0.002 n=6)
Cond 10.316µ ± 0% 9.796µ ± 0% -5.04% (p=0.002 n=6)
Pool 787.1n ± 1% 741.6n ± 2% -5.79% (p=0.002 n=6)
MutexContended 233.6n ± 1% 235.1n ± 0% +0.64% (p=0.002 n=6)
RWMutexContendedRead 276.3n ± 0% 278.6n ± 0% +0.80% (p=0.002 n=6)
RWMutexContendedWrite 499.2n ± 0% 521.5n ± 0% +4.48% (p=0.002 n=6)
Semaphore 970.6n ± 0% 967.6n ± 0% -0.30% (p=0.002 n=6)
Mutex2 213.1n ± 0% 210.9n ± 0% -1.03% (p=0.002 n=6)
RWMutex 253.5n ± 0% 242.2n ± 0% -4.46% (p=0.002 n=6)
Channel2 970.7n ± 0% 968.0n ± 0% -0.28% (p=0.002 n=6)
MapRWMutex/Write 807.4n ± 0% 790.3n ± 0% -2.12% (p=0.002 n=6)
MapRWMutex/Read 335.1n ± 0% 334.7n ± 0% -0.10% (p=0.002 n=6)
MapMutex 503.9n ± 0% 520.6n ± 0% +3.32% (p=0.002 n=6)
geomean 586.8n 550.0n -6.27%
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
I came across this comment from @minux about GOARM:
We made a mistake when defining GOARM: VFP status and ARM architecture version really are two separate property. For example, ARMv5E chips could also have FPU (at least in theory) and there are ARMv6 chips without VFP.
I drew a table to help clarify my understanding of the relationship between a target's ARM architecture version and FPU status, and the right GOARM value to use to generate its binary. (Could be wrong; please correct any errors.) It shows what kind of instructions each GOARM value emits. The ARM architecture version {5, 6, 7} and FPU status {No FPU, VFPv1, VFPv3} are separate properties, on separate axes, as the comment says.
No FPU | VFPv1 | VFPv3 | |
---|---|---|---|
ARMv5 | GOARM=5 | ← | — |
ARMv6 | ↑ | GOARM=6 | — |
ARMv7 | ↑ | ↑ | GOARM=7 |
Many ARM targets are located on the main diagonal. Off-diagonal targets fall back to a GOARM that underutilizes their hardware. Arrows point in the direction of the fallback. For example, an ARMv7 device with no FPU drops down to a binary with ARMv5 instructions. (Dashes represent invalid combinations of architecture versions and FPUs; VFPv3 is not implemented on ARM architectures v5 and v6.)
This proposal aims to avoid the 2 fallback cases in the No FPU column, allowing them to leverage the features of their respective architecture versions. |
No FPU | VFPv1 | VFPv3 | |
---|---|---|---|---|
ARMv5 | GOARM=5 | ← | — | |
ARMv6 | GOARM=6,softfloat | GOARM=6 | — | |
ARMv7 | GOARM=7,softfloat | ↑ | GOARM=7 |
I wanted to frame this problem in the larger context of all fallbacks, to help guide the selection of a new set of GOARM options. For example, one advantage of the proposed softfloat / hardfloat naming scheme is that it is expressive enough to select GOARM=5,hardfloat
and redress another fallback case. This is not to say GOARM=5,hardfloat
ought to be implemented, only that the options generalize well enough to permit the possibility.
Thanks for the numbers showing that 7,softfloat is still better than 5 with checks.
Have all remaining concerns about this proposal been addressed?
GOARM changes to have the form [567](,attrs)?
.
That is, there is now an optional attribute list.
The only two defined attributes are softfloat and hardfloat, specifying software and hardware floating point (same names as for GOMIPS).
It is an error to specify both softfloat and hardfloat.
The leading number cannot be omitted.
softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.
When compiled with GOARM=7,softfloat, code will assume ARMv7 non-FP instructions like atomics but will use software floating point.
Looks good, looking forward to create some real-world benchmarks as this feature might greatly boost performance due to being finally able to use the v6 and v7 ISA on non-FP chips :tada: I'm also in favor of the new optional attribute as this allows aot compilation with optimized asm instead of autodetection via cpu feature bits which aren't always reliable. And it keeps code size down.
Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group
GOARM changes to have the form [567](,attrs)?
.
That is, there is now an optional attribute list.
The only two defined attributes are softfloat and hardfloat, specifying software and hardware floating point (same names as for GOMIPS).
It is an error to specify both softfloat and hardfloat.
The leading number cannot be omitted.
softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.
When compiled with GOARM=7,softfloat, code will assume ARMv7 non-FP instructions like atomics but will use software floating point.
I assume GOARM=5,hardfloat
will be an unsupported configuration?
I assume
GOARM=5,hardfloat
will be an unsupported configuration?
To quote Ludi from earlier:
For example, one advantage of the proposed softfloat / hardfloat naming scheme is that it is expressive enough to select GOARM=5,hardfloat and redress another fallback case. This is not to say GOARM=5,hardfloat ought to be implemented, only that the options generalize well enough to permit the possibility.
So I'd say GOARM=5,hardfloat
should be generally supported as VFP is technically supported on ARMv5 but I never came accross a chip that actually implements this combination (while the other way around, having a higher ISA but no VFP, is more common than one might think). And to streamline the flags and quote Russ:
GOARM changes to have the form
[567](,attrs)?
. [...] softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.
So there should not be a difference of attrs supported for each number imo.
I think it's fine to support 5,hardfloat and easier to support it than to reject it. Maybe people on chips with broken atomics will want it.
I updated my CL to support GOARM=5,hardfloat
. I marked the parts of code that require the eye of a Go compiler team member as todo
. I assume it's too late for this change to make it into the upcoming 1.22 release?
It is not too late yet. The freeze is Nov 21.
No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group
GOARM changes to have the form [567](,attrs)?
.
That is, there is now an optional attribute list.
The only two defined attributes are softfloat and hardfloat, specifying software and hardware floating point (same names as for GOMIPS).
It is an error to specify both softfloat and hardfloat.
The leading number cannot be omitted.
softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.
When compiled with GOARM=7,softfloat, code will assume ARMv7 non-FP instructions like atomics but will use software floating point.
I think this is done. Thank you!
I want to run a go binary on an ARMv7 target that doesn't have a hardware floating point unit (FPU). (The ARMv7 specification does not require a hardware FPU; it is optional.) Currently, the only way to use software floating point on ARM targets is to set
GOARM=5
, regardless of the actual ARM version of the target, whether 5, 6, or 7. If the decision of using software or hardware floating point were decoupled from the ARM version, then there would be no need to fall back to the ARMv5 instruction set on ARMv7 chips lacking a hardware FPU.I request a new go environment variable (perhaps
GOARMFP=soft
orhard
) that could be used alongsideGOARCH=arm
and eitherGOARM=6
orGOARM=7
to specify software ("soft") or hardware ("hard") floating point.GOARM=5
would always imply software floating point.Because this addresses an immediate business need, I have developed a working prototype for
GOARM=7
with software floating point, and could make contributions toward this new setting.