golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.84k stars 17.65k forks source link

runtime: software floating point for GOARM=6, 7 (not only GOARM=5) #61588

Closed ludi317 closed 11 months ago

ludi317 commented 1 year ago

I want to run a go binary on an ARMv7 target that doesn't have a hardware floating point unit (FPU). (The ARMv7 specification does not require a hardware FPU; it is optional.) Currently, the only way to use software floating point on ARM targets is to set GOARM=5, regardless of the actual ARM version of the target, whether 5, 6, or 7. If the decision of using software or hardware floating point were decoupled from the ARM version, then there would be no need to fall back to the ARMv5 instruction set on ARMv7 chips lacking a hardware FPU.

I request a new go environment variable (perhaps GOARMFP=soft or hard) that could be used alongside GOARCH=arm and either GOARM=6 or GOARM=7 to specify software ("soft") or hardware ("hard") floating point. GOARM=5 would always imply software floating point.

Because this addresses an immediate business need, I have developed a working prototype for GOARM=7 with software floating point, and could make contributions toward this new setting.

cherrymui commented 1 year ago

You can try using go build -gcflags=all=-d=softfloat, which should make all compiled code using softfloat. There might be some assembly code that uses floating point, which you might need to rewrite.

mknyszek commented 1 year ago

In triage, we think this needs to be a proposal. Since this isn't explicitly supported (and we don't have hardware for CI to test this configuration, or a test to make sure there aren't any FP instructions when setting the softfloat configuration) we'd have to make a decision to support it.

gopherbot commented 1 year ago

Change https://go.dev/cl/514907 mentions this issue: all: add GOARMFP env var for ARM floating point mode

ludi317 commented 1 year ago

In this comment, another user is forced to downgrade to GOARM=5 on an ARMv7 chip just to get soft floating point (#58686 comment).

An ARMv7 chip should execute ARMv7 instructions. Anything less leaves the CPU underutilized, and is a waste of resources.

To support this proposal, I have submitted a CL that can build GOARM=7 and GOARMFP=soft. Even if this proposal is not approved, I would greatly appreciate a review, or any feedback, on the CL. Thanks.

cherrymui commented 1 year ago

@ludi317 Have you tried the compiler flag -gcflags=all=-d=softfloat? If there is some assembly code that needs to be adjusted we could introduce a macro like -D softfloat that you can pass as -asmflags.

If we really want an environment variable for the go command, my counter proposal: use an existing variable, either GOARM=7,softfloat (see also #60072), or GOEXPERIMENT=softfloat (our softfloat implementation is largely architecture independent (except a small amount of assembly code), so may as well use an architecture independent flag).

ludi317 commented 1 year ago

@cherrymui I did try building with the compiler flag -gcflags=all=-d=softfloat (and commenting out this check). Unfortunately, the binary crashed with signal SIGILL. The assembly code does indeed need to be modified in a few places, as seen in my CL.

I am not particular about the API used to specify soft float for ARM, as long as there is one. If I were to choose, I'd suggest that if GOMIPS64 accepts a comma-separated list of options (as proposed in #60072), then it would make sense for GOARM to do the same. Your proposal to use GOARM=7,softfloat seems very reasonable.

randall77 commented 1 year ago

Can you tell us what this chip is that is armv7 but without floating point? I am curious.

Since you have the change prototyped, what performance differences are you seeing between GOARM=5 and GOARM=7,softfloat? In the compiler at least, the differences I see are mostly bit manipulation instructions (find first bit, etc.). There may be some more in the runtime (memmove?).

MDr164 commented 1 year ago

Can you tell us what this chip is that is armv7 but without floating point? I am curious.

The Aspeed AST2500 for example is a chip that supports the armv6k instruction set but does not have a floating point unit so we need to fall back to GOARM=5 for that one. Another one is the Broadcom BCM4708A0 armv7 SoC that lacks floating point hardware. In general a lot of the cheaper WiFi/AP/Network appliances or deeply embedded SoCs often come without an fpu as it's often times not really needed for the limited usecase of the system.

ludi317 commented 1 year ago

Can you tell us what this chip is that is armv7 but without floating point? I am curious.

The chip is a BCM56160, and is found in a network switch. sysctl shows that the CPU is an ARM Cortex-A9, without an FPU:

root@martini48t-p2a-sys04:RE:0% sysctl hw.model hw.floatingpoint
hw.model: ARM Cortex-A9 r4p1 (ECO: 0x00000000)
hw.floatingpoint: 0

Since you have the change prototyped, what performance differences are you seeing between GOARM=5 and GOARM=7,softfloat?

I never measured the performance of our Go program when GOARM=5. Since the network switch is already CPU-bound, I was concerned that downgrading would only hurt performance.

In the compiler at least, the differences I see are mostly bit manipulation instructions (find first bit, etc.). There may be some more in the runtime (memmove?).

Yes, the runtime leverages ARMv7 features. One example is that when GOARM=7, the runtime opts for ARM-specific atomic operations (armCas64, armXadd64, armXchg64, armLoad64, armStore64). https://github.com/golang/go/blob/2d5ce9b729c0edded841301bd73d68d5e95aa28b/src/runtime/internal/atomic/atomic_arm.s#L249-L253

FWIW, the prototype has matured into a feature implementation that takes GOARM=7,softfloat as an argument. Using this new option, we have built binaries that work as expected on the switch. Please see the CL for the implementation.

Finally, I came across a comment from Russ Cox indicating that back in 2011, Go supported software floating point for GOARM > 5, by setting the -F flag.

cherrymui commented 1 year ago

Finally, I came across a comment from Russ Cox indicating that back in 2011, Go supported software floating point for GOARM > 5, by setting the -F flag.

The softfloat support in Go has been reworked since then. We used to handle it in the linker (5l at the time), at instruction level, which means it would also handle (Go) assembly code (but not cgo). Now we handle it in the compiler, with -gcflags=-d=softfloat, which means it doesn't handle assembly code. So we need a way for that.

randall77 commented 1 year ago

I'd really like to see some performance numbers of the difference between GOARM=5 and GOARM=7,softfloat. If there is little or no difference the whole point of this proposal is kind of moot. It doesn't have to be on these strange chips. Any GOARM=7 capable chip could run some benchmarks in both modes and see. (You'd need to patch in the proposed CL for 7,softfloat support.)

ludi317 commented 1 year ago

@randall77 Please find the requested benchmarks comparing GOARM=5 and GOARM=7,softfloat below. Full source code here.

The benchmarks show many significant performance improvements, and only a few minor degradations. On the AtomicOperationsInt64 benchmark, GOARM=7,softfloat is more than 3x faster than GOARM=5 .

goarch: arm
pkg: github.com/ludi317/arm-wrestle
                                  │ armv5_1cpu_raw.txt │       armv7soft_1cpu_raw.txt       │
                                  │       sec/op       │   sec/op     vs base               │
Float32Arithmetic                          4.944µ ± 1%   4.678µ ± 0%   -5.37% (p=0.002 n=6)
Int32Arithmetic                            15.67n ± 3%   15.65n ± 0%        ~ (p=0.318 n=6)
Float64Arithmetic                          3.905µ ± 0%   3.876µ ± 0%   -0.74% (p=0.002 n=6)
Int64Arithmetic                            29.06n ± 0%   29.07n ± 0%   +0.03% (p=0.015 n=6)
ANDconstBICconst                           52.53n ± 0%   52.55n ± 0%   +0.03% (p=0.035 n=6)
Uint64Move                                 22.35n ± 0%   22.36n ± 0%        ~ (p=1.000 n=6)
ADD                                        1.049µ ± 0%   1.009µ ± 0%   -3.81% (p=0.002 n=6)
ADDBICconst                                20.12n ± 0%   19.00n ± 0%   -5.57% (p=0.002 n=6)
ADDBICconstInt64                           29.07n ± 0%   27.94n ± 0%   -3.87% (p=0.002 n=6)
WithMulDAndMulF                           1029.0n ± 0%   986.2n ± 0%   -4.16% (p=0.002 n=6)
BitwiseInt32                               8.942n ± 0%   8.942n ± 0%        ~ (p=0.773 n=6)
BitwiseInt64                               13.42n ± 0%   13.42n ± 0%        ~ (p=1.000 n=6)
TrailingZeros                              43.59n ± 0%   30.18n ± 0%  -30.76% (p=0.002 n=6)
ProducerConsumerBufferedCh                 3.894µ ± 0%   3.603µ ± 0%   -7.46% (p=0.002 n=6)
ProducerConsumerBufferedChInt64            3.961µ ± 1%   3.631µ ± 0%   -8.33% (p=0.002 n=6)
ProducerConsumerUnBufferedCh               5.099µ ± 0%   4.701µ ± 0%   -7.81% (p=0.002 n=6)
ProducerConsumerUnBufferedChInt64          5.073µ ± 0%   4.634µ ± 0%   -8.65% (p=0.002 n=6)
GetCntxct                                  3.851µ ± 0%   3.578µ ± 0%   -7.10% (p=0.002 n=6)
CASInt32                                   158.9n ± 0%   160.9n ± 0%   +1.26% (p=0.002 n=6)
CASInt64                                   502.1n ± 0%   157.3n ± 3%  -68.66% (p=0.002 n=6)
CASUint64                                  502.1n ± 0%   157.5n ± 0%  -68.64% (p=0.002 n=6)
CASUint32                                  158.9n ± 0%   166.7n ± 0%   +4.91% (p=0.002 n=6)
CASUintptr                                 158.9n ± 0%   167.8n ± 3%   +5.60% (p=0.002 n=6)
AtomicOperationsInt64                      931.1n ± 0%   268.6n ± 0%  -71.15% (p=0.002 n=6)
AtomicOperationsInt32                      306.6n ± 0%   297.6n ± 0%   -2.92% (p=0.002 n=6)
AtomicOperationsUint64                     928.8n ± 0%   270.8n ± 0%  -70.84% (p=0.002 n=6)
AtomicOperationsUint32                     306.6n ± 0%   297.6n ± 0%   -2.92% (p=0.002 n=6)
AtomicOperationsUintptr                    308.8n ± 0%   304.4n ± 0%   -1.42% (p=0.002 n=6)
AtomicOperationsBool                       537.1n ± 0%   494.5n ± 0%   -7.93% (p=0.002 n=6)
geomean                                    300.4n        245.5n       -18.28%
randall77 commented 1 year ago

So it looks like math/bits and 64-bit atomics are the regressions.

The math/bits one is pretty minor, GOARM=5 is missing the RBIT instruction so getting trailing bits takes 2 more instructions. I think ReverseBytes is similar. (Reverse32 should be a lot faster on GOARM=7, but no one has optimized that function to use RBIT.)

The 64-bit atomic costs are more substantial. The arm atomics already do a runtime check, but they just use the GOARM value the binary was built with. If we can detect the presence of the atomic instructions we need (LDREXD/STREXD, maybe also DMB?) at runtime, then we can base the runtime check on the actual hardware we're running on.

randall77 commented 1 year ago

LDREXTD/STREXD can be detected using the lpae feature bit. (Particularly, detecting that they will be 64-bit atomic.) It looks like we also need to make sure the DMB instruction is available. It is only available starting in v7, so we need a way to detect that the chip is v7. Anyone know how to get that from feature bits? Currently we check vfp and vfpv3, but of course that's too strict if we're trying to run on fp-less chips.

ludi317 commented 1 year ago

@randall77 I thought the performance deltas in the channel-backed ProducerConsumer benchmarks (-8%) were also interesting, even though they were not as large as those of the math/bits and 64-bit atomic benchmarks.

Based on that finding, I wrote more benchmarks to compare the performance of synchronization primitives between the two builds. Please find the results below. The Mutex benchmarks that acquire a mutex lock, do some work, then release the lock are ~2x faster on GOARM=7,softlfloat.

goos: linux
goarch: arm
pkg: github.com/ludi317/arm-wrestle
                                  │ armv5_1cpu_raw.txt │       armv7soft_1cpu_raw.txt       │
                                  │       sec/op       │   sec/op     vs base               │
                                  ...
Mutex                                      44.94µ ± 0%   22.60µ ± 0%  -49.70% (p=0.002 n=6)
RWMutex_Read                               45.00µ ± 0%   22.65µ ± 0%  -49.67% (p=0.002 n=6)
RWMutex_Write                              45.22µ ± 0%   22.87µ ± 0%  -49.42% (p=0.002 n=6)
WaitGroup                                  90.63m ± 4%   77.13m ± 4%  -14.89% (p=0.002 n=6)
Channel                                    8.781m ± 0%   8.383m ± 0%   -4.54% (p=0.002 n=6)
AtomicAdd                                 259.40n ± 0%   73.86n ± 0%  -71.53% (p=0.002 n=6)
Once                                       67.11n ± 0%   64.87n ± 0%   -3.33% (p=0.002 n=6)
Cond                                      11.126µ ± 0%   9.781µ ± 0%  -12.09% (p=0.002 n=6)
Pool                                       774.5n ± 1%   723.2n ± 1%   -6.62% (p=0.002 n=6)
randall77 commented 1 year ago

I suspect that the channel differences are all due to the synchronization primitives that channels use, for which we know there is already a sizable performance difference.

gopherbot commented 1 year ago

Change https://go.dev/cl/525637 mentions this issue: runtime: on arm32, detect whether we have sync instructions

ludi317 commented 1 year ago

Is there any additional information needed to move this proposal to "Active" status? To summarize,

rsc commented 1 year ago

It sounds like https://go.dev/cl/525637 is the right thing to try first, since it is not a visible API change and does not require a proposal at all. @ludi317 can you please rerun your GOARM=5 benchmarks with Keith's CL patched in?

ludi317 commented 1 year ago

@rsc Please find the requested benchmarks comparing GOARM=5 with Keith's CL applied and GOARM=7,softfloat.

rsc commented 1 year ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

ludi317 commented 1 year ago

I came across this comment from @minux about GOARM:

We made a mistake when defining GOARM: VFP status and ARM architecture version really are two separate property. For example, ARMv5E chips could also have FPU (at least in theory) and there are ARMv6 chips without VFP.

I drew a table to help clarify my understanding of the relationship between a target's ARM architecture version and FPU status, and the right GOARM value to use to generate its binary. (Could be wrong; please correct any errors.) It shows what kind of instructions each GOARM value emits. The ARM architecture version {5, 6, 7} and FPU status {No FPU, VFPv1, VFPv3} are separate properties, on separate axes, as the comment says.

No FPU VFPv1 VFPv3
ARMv5 GOARM=5
ARMv6 GOARM=6
ARMv7 GOARM=7

Many ARM targets are located on the main diagonal. Off-diagonal targets fall back to a GOARM that underutilizes their hardware. Arrows point in the direction of the fallback. For example, an ARMv7 device with no FPU drops down to a binary with ARMv5 instructions. (Dashes represent invalid combinations of architecture versions and FPUs; VFPv3 is not implemented on ARM architectures v5 and v6.)

This proposal aims to avoid the 2 fallback cases in the No FPU column, allowing them to leverage the features of their respective architecture versions. No FPU VFPv1 VFPv3
ARMv5 GOARM=5
ARMv6 GOARM=6,softfloat GOARM=6
ARMv7 GOARM=7,softfloat GOARM=7

I wanted to frame this problem in the larger context of all fallbacks, to help guide the selection of a new set of GOARM options. For example, one advantage of the proposed softfloat / hardfloat naming scheme is that it is expressive enough to select GOARM=5,hardfloat and redress another fallback case. This is not to say GOARM=5,hardfloat ought to be implemented, only that the options generalize well enough to permit the possibility.

rsc commented 1 year ago

Thanks for the numbers showing that 7,softfloat is still better than 5 with checks.

rsc commented 1 year ago

Have all remaining concerns about this proposal been addressed?

GOARM changes to have the form [567](,attrs)?. That is, there is now an optional attribute list. The only two defined attributes are softfloat and hardfloat, specifying software and hardware floating point (same names as for GOMIPS). It is an error to specify both softfloat and hardfloat. The leading number cannot be omitted. softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.

When compiled with GOARM=7,softfloat, code will assume ARMv7 non-FP instructions like atomics but will use software floating point.

MDr164 commented 12 months ago

Looks good, looking forward to create some real-world benchmarks as this feature might greatly boost performance due to being finally able to use the v6 and v7 ISA on non-FP chips :tada: I'm also in favor of the new optional attribute as this allows aot compilation with optimized asm instead of autodetection via cpu feature bits which aren't always reliable. And it keeps code size down.

rsc commented 11 months ago

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

GOARM changes to have the form [567](,attrs)?. That is, there is now an optional attribute list. The only two defined attributes are softfloat and hardfloat, specifying software and hardware floating point (same names as for GOMIPS). It is an error to specify both softfloat and hardfloat. The leading number cannot be omitted. softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.

When compiled with GOARM=7,softfloat, code will assume ARMv7 non-FP instructions like atomics but will use software floating point.

cherrymui commented 11 months ago

I assume GOARM=5,hardfloat will be an unsupported configuration?

MDr164 commented 11 months ago

I assume GOARM=5,hardfloat will be an unsupported configuration?

To quote Ludi from earlier:

For example, one advantage of the proposed softfloat / hardfloat naming scheme is that it is expressive enough to select GOARM=5,hardfloat and redress another fallback case. This is not to say GOARM=5,hardfloat ought to be implemented, only that the options generalize well enough to permit the possibility.

So I'd say GOARM=5,hardfloat should be generally supported as VFP is technically supported on ARMv5 but I never came accross a chip that actually implements this combination (while the other way around, having a higher ISA but no VFP, is more common than one might think). And to streamline the flags and quote Russ:

GOARM changes to have the form [567](,attrs)?. [...] softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.

So there should not be a difference of attrs supported for each number imo.

rsc commented 11 months ago

I think it's fine to support 5,hardfloat and easier to support it than to reject it. Maybe people on chips with broken atomics will want it.

ludi317 commented 11 months ago

I updated my CL to support GOARM=5,hardfloat. I marked the parts of code that require the eye of a Go compiler team member as todo. I assume it's too late for this change to make it into the upcoming 1.22 release?

randall77 commented 11 months ago

It is not too late yet. The freeze is Nov 21.

rsc commented 11 months ago

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

GOARM changes to have the form [567](,attrs)?. That is, there is now an optional attribute list. The only two defined attributes are softfloat and hardfloat, specifying software and hardware floating point (same names as for GOMIPS). It is an error to specify both softfloat and hardfloat. The leading number cannot be omitted. softfloat is the default for GOARM=5 and hardfloat is the default for GOARM=6 and GOARM=7.

When compiled with GOARM=7,softfloat, code will assume ARMv7 non-FP instructions like atomics but will use software floating point.

MDr164 commented 11 months ago

The CL has been merged, I guess this can be marked as resolved then?

cherrymui commented 11 months ago

I think this is done. Thank you!