golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.15k stars 17.56k forks source link

cmd/compile: add GOAMD64 architecture variants #25489

Closed josharian closed 2 years ago

josharian commented 6 years ago

This is a data-gathering bug to collect instances in which a GOAMD64 environment variable would be helpful. Please feel free to add to it. My hope is that over time, if/when these use cases accumulate, we will get a clearer picture of whether to add GOAMD64, and if so, what shape it should take.

Uses:

cc @randall77 @martisch @TocarIP @Quasilyte @dr2chase

TocarIP commented 6 years ago

I'd like to add:

There is also a problem with opting into 512-bit version of AVX512. By default only 256-bit version should be used to avoid frequency drop, but there should be a mechanism to opt-in for 512-bit version for faster memmove/crypto/etc

mvdan commented 6 years ago

I thought that the Go runtime and standard library already had ways to check for certain hardware features at run-time. How do these enhancements compare to that? If these were to be made at compile-time, would the run-time checks be removed?

randall77 commented 6 years ago

It is preferable to use runtime detection when possible. For any sizeable chunk of code, that's what we currently do. But especially for individual instructions, the overhead for runtime checking can be large. In that case, it can be preferable to "bake in" the decision during compilation. That's what such an environment variable would do.

mvdan commented 6 years ago

Apologues for asking more questions - if we add an environment variable, wouldn't it be simpler and faster to simply use the optimal code for the hardware, getting rid of all run-time detection?

Unless the point is to do a best-effort combination of the both, so that very expensive code like crypto would still perform well on newer hardware even if the environment variable wasn't set up properly?

randall77 commented 6 years ago

We want to use the environment variable as little as possible, as it bakes the optimization decision in at compile time. As a consequence, the binary cannot then adapt at runtime; it would fail on an old machine, or not use optimal code on a new machine. This will lead to people having to distribute multiple copies of their binary, possibly many depending on how many options are available. It's also extra API surface to the compiler which we would have to spec & maintain. So I think we'd need a really good reason to go down this road.

quasilyte commented 6 years ago

function multi-versioning was proposed some time ago as a compromise between less run-time checks and portability between older/newer CPUs.

I can't find discussion right now but the approach with checking CPUID features once and then using specialized version looks good (at least on paper).

I have no specific proposals, more over, I'm not familiar with referenced technique well. This is just a note that there is another approaches between "full compile-time" and "full run-time" selection. Thank you.

quasilyte commented 6 years ago

For some cases indirect call would be slower than call + branch inside it though. Needs a lot of investigation.

There is also JIT-like approach where we re-write function code segment during startup if there are more optimized versions available. This is not possible for every imaginable platforms, but given this is mostly about AMD64, seems doable in that context.

agnivade commented 6 years ago

Great ! I have been wanting this for a long time - https://groups.google.com/forum/#!topic/golang-nuts/b75xYuORaok :) My thinking was that using the non-destructive 3 operand form of VEX encoded instructions will alleviate register pressure and will make the binaries much smaller (Unfortunately I know nothing about compiler internals to do a PoC and confirm this).

@ianlancetaylor had mentioned -

I don't think it's worth adding a GOAMD64 environment variable for this. I don't think the gain is worth the complexity. Happy to be proven wrong.

TocarIP commented 6 years ago

I've implemented a prototype in https://go-review.googlesource.com/c/go/+/117295 to see how beneficial this can be. @agnivade can you check if it with your binaries?

agnivade commented 6 years ago

With pleasure :)

agnivade commented 6 years ago

I have done some testing.

On binary size, I am seeing a slight decrease :tada:. I tested with 3 production applications. 2 network services and 1 webapp. This is without -ldflags="-s -w". Baseline was with current master.

Service1 - 17488784 to 17482928 = 5856 Service2 - 16163230 to 16151311 = 11919 Webapp - 21802366 to 21794461= 7905

On the perf side, general application benchmarks do not show noticeable improvements. But if I do micro-benchmarks targeting very specific functions - for eg if I benchmark math.Atan2 which has lot of FMA statements, I can see improvement of 2ns.

I looked into the objdump of the binaries to investigate possible reasons for this (with whatever meagre assembly knowledge I have).

One thing that I noticed is that the V(ADD|SUB|MUL|DIV)SD operations always act on registers, and not on memory operands which they are especially useful for in reducing register pressure.

For eg. I see patterns like this throughout -

vmovsd 0x66323(%rip),%xmm3        # 477a38 <$f64.4059000000000000>
vdivsd %xmm3,%xmm2,%xmm2

vmovsd 0x662bf(%rip),%xmm3        # 4779e0 <$f64.3ff0000000000000>
vsubsd %xmm3,%xmm0,%xmm0

vmovsd 0xc6b83(%rip),%xmm4        # 4d82b0 <runtime.memstats+0x1730>
vsubsd %xmm4,%xmm2,%xmm5

which I believe can be optimized to -

vdivsd 0x66323(%rip),%xmm2,%xmm2

vsubsd 0x662bf(%rip),%xmm0,%xmm0

vsubsd 0xc6b83(%rip),%xmm2,%xmm5

That's all I could glean. Happy to provide more info if required.

dr2chase commented 6 years ago

Benchmarks that I ran Friday, got busy and forgot to post: https://perf.golang.org/search?q=upload:20180611.1 Quiet machine, using Austin's perflock to make things even more sane. Saw -0.5% geomean time change, one benchmark got worse for unclear reasons. Binaries tended slightly smaller:

size *_*
   text    data     bss     dec     hex filename
3745235  104712  137584 3987531  3cd84b capnproto2_Tip
3743659  104712  137584 3985955  3cd223 capnproto2_TryCL117925
2380835  101072  136296 2618203  27f35b ethereum_bitutil_Tip
2379013  101072  136296 2616381  27ec3d ethereum_bitutil_TryCL117925
4703293  118072  181832 5003197  4c57bd ethereum_corevm_Tip
4700462  118072  181832 5000366  4c4cae ethereum_corevm_TryCL117925
3568662  116248  137736 3822646  3a5436 ethereum_ecies_Tip
3566582  116248  137736 3820566  3a4c16 ethereum_ecies_TryCL117925
7089434  260153  149576 7499163  726d9b ethereum_ethash_Tip
7087579  260153  149576 7497308  72665c ethereum_ethash_TryCL117925
2408673   95248  135720 2639641  284719 ethereum_sha3_Tip
2407012   95248  135720 2637980  28409c ethereum_sha3_TryCL117925
4680040  108704  139984 4928728  4b34d8 ethereum_storage_Tip
4678379  108704  139984 4927067  4b2e5b ethereum_storage_TryCL117925
5023812  111384  139656 5274852  507ce4 ethereum_trie_Tip
5021601  111384  139656 5272641  507441 ethereum_trie_TryCL117925
9392334  881770  152488 10426592     9f18e0 ethereum_whisperv5_Tip
9390867  881770  152488 10425125     9f1325 ethereum_whisperv5_TryCL117925
3297273  174104  135432 3606809  370919 gonum_blas_native_Tip
3292393  174104  135432 3601929  36f609 gonum_blas_native_TryCL117925
2768825  122864  145672 3037361  2e58b1 gonum_community_Tip
2767891  122864  145672 3036427  2e550b gonum_community_TryCL117925
5086714  287120  135816 5509650  541212 gonum_lapack_native_Tip
5086303  287120  135816 5509239  541077 gonum_lapack_native_TryCL117925
4601987  124912  143528 4870427  4a511b gonum_mat_Tip
4600506  124912  143528 4868946  4a4b52 gonum_mat_TryCL117925
2840813  222312  143400 3206525  30ed7d gonum_path_Tip
2839674  222312  143400 3205386  30e90a gonum_path_TryCL117925
2444690  103504  144296 2692490  29158a gonum_topo_Tip
2442897  103504  144296 2690697  290e89 gonum_topo_TryCL117925
2237492   97360  143080 2477932  25cf6c gonum_traverse_Tip
2235723   97360  143080 2476163  25c883 gonum_traverse_TryCL117925
14978433     400569  153168 15532170     ed008a hugo_helpers_Tip
14976682     400569  153168 15530419     ecf9b3 hugo_helpers_TryCL117925
20526259     436953  186064 21149276    142b65c hugo_hugolib_sitebuilding_Tip
20521440     436953  186064 21144457    142a389 hugo_hugolib_sitebuilding_TryCL117925
20526259     436953  186064 21149276    142b65c hugo_hugolib_Tip
20521440     436953  186064 21144457    142a389 hugo_hugolib_TryCL117925
24354787     323009  162576 24840372    17b08b4 k8s_api_Tip
24343935     323009  162576 24829520    17ade50 k8s_api_TryCL117925
14824181     283489  155440 15263110     e8e586 k8s_schedulercache_Tip
14818560     283489  155440 15257489     e8cf91 k8s_schedulercache_TryCL117925
2534866   96176  133736 2764778  2a2fea semver_Tip
2533177   96176  133736 2763089  2a2951 semver_TryCL117925
2418677   94992  133640 2647309  28650d spexs2_Tip
2416868   94992  133640 2645500  285dfc spexs2_TryCL117925
6954182  226921  147440 7328543  6fd31f uber_zap_Tip
6953004  226921  147440 7327365  6fce85 uber_zap_TryCL117925
3659912  107880  143376 3911168  3bae00 uuid_Tip
3658142  107880  143376 3909398  3ba716 uuid_TryCL117925
agnivade commented 6 years ago

I apologize for my stupidity. I was comparing with 1.10.2. :disappointed: I have updated my comment. The results match with @dr2chase's observations.

dr2chase commented 6 years ago

No need for the s-word, benchmarking is picky work that's easy to get wrong, I ended up writing a tool to help me do it because I could not trust myself to do it all correctly and reproducibly, and I really ought to make that tool automate even more of the steps.

TocarIP commented 6 years ago

@dr2chase what machine are you using? I've tried to reproduce Growth_MultiSegment regression, but it showed no changes.

dr2chase commented 6 years ago

It is "Model name: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz" I used Austin's perflock to reduce risk of heat-related performance changes, used a quiet machine.

josharian commented 5 years ago

It'd nice to be able to use BLSR from BMI1, e.g. to rapidly iterate over set bits, which is something that we do in a few hot functions in the runtime (adjustpointers, scanblock).

seebs commented 5 years ago

I haven't got data on the performance impact yet, because I have to hack the compiler to do it, but we have a tight loop that's oring two 8KB blocks together, and we also want to popcount them, and right now, on amd64, adding popcount more than doubles runtime for the loop according to naive benchmarks.

https://godbolt.org/z/1L6f_N

The difference from adding a popcount inside the loop is larger than I'd have expected:

        movq    (CX)(DX*8), BX
        movq    (AX)(DX*8), SI
        orq     SI, BX
        movq    BX, (CX)(DX*8)

becomes

        movq    8(CX)(DX*8), DI
        movq    8(AX)(DX*8), R8
        orq    R8, DI
        movq    DI, 8(CX)(DX*8)
        addq    SI, BX
        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     unionBitmapBitmapInPlaceTemps_pc349
        xorq    SI, SI
        popcntq DI, SI

Note, the "addq" in the middle is actually adding a previous popcount, this loop's popcount is added to a counter about four instructions later. So this isn't the actual instructions for this step, but it's the same instructions as them.

I don't understand why there's an xorq SI, SI right before the popcount, because the popcount just overwrites that register anyway, I think? But my expectation would have been that popcount would have fairly low impact, because the entire operation is trivial enough that I wouldn't expect it to be running out of computation before it gets stuck waiting for cache lines. I did not expect it to double runtime cost. (I expected just the popcountq and addq instructions, basically.)

If time allows, I'm gonna hack my local compiler to ditch that and see what the impact is like.

zeebo commented 5 years ago

For the XORQ SI, SI, if I recall correctly, this is because on some architectures popcnt has a false dependency on the output register.

seebs commented 5 years ago

So I made a funny mistake which actually produced informative results: I wrote a benchmark for this in which I omitted any actual use of the results of popcount operations.

As it happens, this has the curious effect of generating the conditional jumps to call the popcount function if necessary, but not generating the actual popcount instructions (as the compiler figured out that the value being set to their sum was unused).

So, for basically similar code, my reported times were:

no popcount: 450ns/op conditionals but no popcount: 600ns/op popcount included: 800ns/op

(somewhat approximate values)

I then went ahead and did the fairly obvious thing to just take away the runtime tests for OnesCount64, and with those, I get 444/444/520.

What's interesting about this is that the cost of adding the popcount instructions without the conditional tests is dramatically lower (~80ns) than the cost of adding the popcount instructions with the conditional tests (~200ns more than the conditionals without the popcounts).

So, for this specific function, it looks like the difference would be ~520ns vs. ~800ns. A 35% reduction in runtime for a hot loop is pretty tempting.

I don't really know enough about modern machines to know how common it is to find amd64 hardware without popcount. But I'd say that, yeah, I would be pretty interested in a target that allowed me to get that improvement on something that has been a pain point for us in the past.

Just in case it's relevant: Looking in more detail with perf, it looks like the conditionals are noticably slowing execution down, even though they're virtually never mispredicted.

using 1.12.9:

       3966.613205      task-clock (msec)         #    1.001 CPUs utilized          
               842      context-switches          #    0.212 K/sec                  
                28      cpu-migrations            #    0.007 K/sec                  
             1,233      page-faults               #    0.311 K/sec                  
    14,233,634,600      cycles                    #    3.588 GHz                      (30.71%)
    51,543,275,201      instructions              #    3.62  insn per cycle           (38.51%)
     7,791,177,587      branches                  # 1964.189 M/sec                    (38.43%)
         5,647,268      branch-misses             #    0.07% of all branches          (38.31%)
    15,594,961,940      L1-dcache-loads           # 3931.556 M/sec                    (38.44%)
         2,310,978      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.47%)
            52,691      LLC-loads                 #    0.013 M/sec                    (30.64%)
            17,671      LLC-load-misses           #   33.54% of all LL-cache hits     (30.63%)
   <not supported>      L1-icache-loads                                             
           556,635      L1-icache-load-misses                                         (30.73%)
    15,533,175,254      dTLB-loads                # 3915.979 M/sec                    (30.83%)
             3,938      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.93%)
             6,249      iTLB-loads                #    0.002 M/sec                    (31.06%)
               856      iTLB-load-misses          #   13.70% of all iTLB cache hits   (30.83%)

Using the same thing, but with the conditional jump for OnesCount64 omitted:

       2615.934480      task-clock (msec)         #    1.001 CPUs utilized          
               770      context-switches          #    0.294 K/sec                  
                34      cpu-migrations            #    0.013 K/sec                  
             1,238      page-faults               #    0.473 K/sec                  
     9,387,417,295      cycles                    #    3.589 GHz                      (30.25%)
    37,574,374,842      instructions              #    4.00  insn per cycle           (38.10%)
     1,338,302,028      branches                  #  511.596 M/sec                    (38.27%)
         5,577,182      branch-misses             #    0.42% of all branches          (38.38%)
    10,322,451,044      L1-dcache-loads           # 3945.990 M/sec                    (38.52%)
         2,035,316      L1-dcache-load-misses     #    0.02% of all L1-dcache hits    (38.58%)
            42,063      LLC-loads                 #    0.016 M/sec                    (30.87%)
            12,212      LLC-load-misses           #   29.03% of all LL-cache hits     (31.17%)
   <not supported>      L1-icache-loads                                             
           379,094      L1-icache-load-misses                                         (31.17%)
    10,403,203,012      dTLB-loads                # 3976.859 M/sec                    (31.17%)
             1,747      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.85%)
             2,515      iTLB-loads                #    0.961 K/sec                    (30.46%)
               577      iTLB-load-misses          #   22.94% of all iTLB cache hits   (30.31%)

Branch misses don't seem to be a significant factor here, but something about this is causing some stalling, giving us a noticable drop in instructions/cycle, but also increasing instructions executed about 35%.

YMMV.

mdempsky commented 3 years ago

As another data point, https://go-review.googlesource.com/c/go/+/299491 is a proof-of-concept CL that adds support for a handful of Haswell instructions to the compiler, which I (a not particularly savvy micro-optimizer) was able to use to improve encoding/binary.Uvarint's throughput by 166% (from 616ns to 231ns).

For comparison, the Go protobuf package includes a fully unrolled version of varint decoding (because varint decoding is a very hot code path within protobuf unmarshalling, which takes 347ns (+78% throughput). Applying the extra tricks from CL 299491 (using rotation instead of shifts, having a fast path for len(buf) >= 8 to eliminate bounds checks, early loading of the slice data pointer) bumps that a little further to about 315ns (+96%). Still measurably short of using Haswell extensions.

I think large-scale cloud deployments would benefit from being able to statically build Go binaries that take advantage of more recent x86 instruction set extensions. Haswell (2013) is older than POWER9 (2017), which is configurable via GOPPC64.

josharian commented 3 years ago

Varint decoding is also very hot code in the runtime, particularly during stack copying. See e.g. https://go-review.googlesource.com/43150.

mdempsky commented 3 years ago

Another data point: https://go-review.googlesource.com/c/go/+/299649 reworks scanobject to use some Haswell instructions, and gets a 4% geomean performance improvement in compilebench. (Not bad for another 1 day hack.)

I'm going to see if I can benchmark some longer-lived processes, and then also look into heapBitsSetType.

randall77 commented 3 years ago

FWIW, heapBitsSetType would be much simpler with a 1-bit bitmap, something we've been thinking about. That may obviate the need for the fancy bit manipulation instructions.

mdempsky commented 3 years ago

I asked Google's C++ toolchain team for advice here, and they pointed out the x86-64 psABI now defines microarchitecture levels with specific feature sets: https://en.wikipedia.org/wiki/X86-64#Microarchitecture_Levels. Also, this blog post from Red Hat giving more context: https://developers.redhat.com/blog/2021/01/05/building-red-hat-enterprise-linux-9-for-the-x86-64-v2-microarchitecture-level/

It seems like the three immediately useful targets are "baseline", "v2", and "v3". Baseline is what we support today. v2 and v3 each offer new instructions that could be used for better optimizing current Go code (e.g., POPCNT in v2, BMI1/BMI2 in v3).

I don't see any immediate benefit from supporting v4 today. Go doesn't have native SIMD support, and AVX512 hardware support seems comparatively less common today still. (There are also lots of discussions about AVX512 having adverse system-level effects due to CPU down-throttling, etc.)

So my tentative proposal would be GOAMD64=baseline, GOAMD64=v2, and GOAMD64=v3. I'm open to alternatives to "baseline", but the psABI doesn't actually give this level a version name of its own. (But maybe we could suggest they retroactively name it "v1".)

There have also been suggestions that binaries should check at startup time whether the CPU supports the required feature, so we don't get lots of SIGILL issue reports from users. That seems reasonable to me.

Also, there's been some brainstorming about various ways we could automatically compile code multiple ways and efficiently dispatch to the right code paths at a higher-level to balance between binary size and runtime dispatch overhead. I think that's intriguing, but it's not obvious to me that we have a concrete solution here or anyone with time to investigate/implement that today.

CAFxX commented 3 years ago

So my tentative proposal would be GOAMD64=baseline, GOAMD64=v2, and GOAMD64=v3

How about also GOAMD64=native, that would pick the highest one available on the machine that is compiling the code?

ianlancetaylor commented 3 years ago

Historically we've avoided things like GOAMD64=native. Instead we let the bootstrap process set the default value of variables like GOAMD64 based on the machine on which the bootstrap is run.

Which isn't to say that we can't change that behavior, but I think that should be a separate decision, and one that should apply to all the GOARCH environment variables, not just GOAMD64.

mdempsky commented 3 years ago

I submitted #45453 as a proposal for this (GOAMD64 architecture variants), along with related proposals #45454 (build tags for distinguishing architecture variants) and #45455 (new intrinsics included in Haswell's BMI2 instruction set).

gopherbot commented 3 years ago

Change https://golang.org/cl/351191 mentions this issue: runtime: check a microarchitecture level at startup

martisch commented 2 years ago

Since #45453 is resolved I think we can close this bug too.