cmd/compile: iter implementations significantly slower than equivalent for loops

kscooo commented 3 months ago

Go version

go version go1.23.0 darwin/arm64(gotip too)

Output of `go env` in your module/workspace:

GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/admin/Library/Caches/go-build'
GOENV='/Users/admin/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/admin/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/admin/go'
GOPRIVATE=''
GOPROXY=''
GOROOT='/opt/homebrew/Cellar/go/1.23.0/libexec'
GOSUMDB='sum.golang.google.cn'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.23.0/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='on'
GOTELEMETRYDIR='/Users/admin/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/Users/admin/Developer/test/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/7s/9gl4fgf97rlgr4gt47nythtw0000gn/T/go-build1880376490=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Related Go files:

iter: https://go.dev/play/p/iRuU4kNXngq iter_test: https://go.dev/play/p/4C_EbsSnlQH

go test -bench=. -benchmem
goos: darwin
goarch: arm64
pkg: ksco/test
cpu: Apple M3 Pro
BenchmarkSliceFunctions/AllForLoop-10-12            321253351            3.475 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/All-10-12                   250128255            4.530 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/BackwardForLoop-10-12       344788078            3.509 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Backward-10-12              87433018            13.84 ns/op        0 B/op          0 allocs/op
BenchmarkSliceFunctions/ValuesForLoop-10-12         344200261            3.476 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Values-10-12                263804847            4.544 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/AppendForLoop-10-12         13531671            87.18 ns/op      248 B/op          5 allocs/op
BenchmarkSliceFunctions/AppendSeq-10-12              8190546           145.6 ns/op       312 B/op          8 allocs/op
BenchmarkSliceFunctions/CollectForLoop-10-12        92614030            13.41 ns/op       80 B/op          1 allocs/op
BenchmarkSliceFunctions/Collect-10-12                8045440           146.8 ns/op       312 B/op          8 allocs/op
BenchmarkSliceFunctions/SortForLoop-10-12           57416722            20.59 ns/op       80 B/op          1 allocs/op
BenchmarkSliceFunctions/Sorted-10-12                 7757234           153.0 ns/op       312 B/op          8 allocs/op
BenchmarkSliceFunctions/ChunkForLoop-10-12          1000000000           0.7995 ns/op          0 B/op          0 allocs/op
BenchmarkSliceFunctions/Chunk-10-12                 231948748            5.167 ns/op           0 B/op          0 allocs/op
BenchmarkMapFunctions/AllForLoopMap-10-12           15668906            76.65 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/AllMap-10-12                  15576559            76.58 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/KeysForLoopMap-10-12          15780648            75.70 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/KeysMap-10-12                 15699544            76.53 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/ValuesForLoopMap-10-12        15928665            75.93 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/ValuesMap-10-12               15532413            76.55 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/InsertForLoopMap-10-12         1803122           668.6 ns/op      1401 B/op          2 allocs/op
BenchmarkMapFunctions/InsertMap-10-12                1655206           728.1 ns/op      1489 B/op          5 allocs/op
BenchmarkMapFunctions/CollectForLoopMap-10-12        5277978           226.0 ns/op       420 B/op          1 allocs/op
BenchmarkMapFunctions/CollectMap-10-12               3718111           321.8 ns/op       716 B/op          5 allocs/op
PASS
ok      ksco/test   34.819s

Linux machines and x86 will also be a bit slower. Gotip was also used, with similar results.

Additionally, when examining the assembly output generated by

go build -gcflags="-S" iter.go

I noticed that certain functions contain additional instructions that appear to be unnecessary, which could be contributing to the observed performance differences.

What did you see happen?

Analysis of the generated assembly revealed that iterator-based implementations (e.g., slices.All, slices.Backward, slices.Chunk) introduce additional overhead compared to traditional for-loops:

Additional function calls:
- Iterator functions themselves
- Closure function calls
- Yield function calls
Memory allocations:
- Heap allocations for closures and iterator states (via runtime.newobject)
- Larger stack frames
Additional control flow:
- Iterator state checks
- Yield function return checks
Indirect function calls:
- Calls through function pointers (e.g., CALL (R4) observed in the chunk function)
Increased register usage and stack operations:
- More registers used for managing iterator state
- More frequent stack operations for saving and restoring state
Additional safety checks:
- E.g., slice size validation in slices.Chunk
Increased code size:
- Iterator versions of functions are typically larger than their for-loop counterparts

Specifically for slices.Chunk observed:

runtime.newobject calls for creating closure objects
Closure setup, including function pointer and captured variable initialization
Creation and invocation of slices.Chunk[go.shape.[]int,go.shape.int].func1
Multiple closure calls during iteration
Checks on yield function return values

Similar issues were observed in other iterator-related function implementations.

What did you expect to see?

According to the Go Wiki's Rangefunc Experiment documentation, the optimized code structure in simple cases is almost identical to a manually written for loop.

However, assembly analysis suggests that the current implementations may introduce complexity and potential performance overhead. While these implementations are already quite effective, there is hope that further optimizations could align their performance with traditional for loops in most simple scenarios.

gabyhelp commented 3 months ago

Related Issues and Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

cherrymui commented 3 months ago

Could you try Go at tip of the master branch? There were some recent optimizations that didn't make into Go 1.23. Thanks!

cc @golang/compiler @dr2chase

cherrymui commented 3 months ago

cpu: Apple M3 Pro

Also, @dr2chase discovered that Apple Silicon chips may have some weird performance behaviors that we don't yet understand. Specifically, the inner loop runs faster if its address range crosses a 4K boundary, which intuitively we'd expect it to be slower... Could you try running on a different machine? And try building with -ldflags=-randlayout=N (where N is some nonzero integer) for some randomized address layout? Thanks.

qiulaidongfeng commented 3 months ago

I use go version devel go1.24-4f18477d Thu Aug 22 13:19:44 2024 +0000 windows/amd64 run benckmark, I got:

goos: windows
goarch: amd64
cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
BenchmarkSliceFunctions/AllForLoop-10-16                464714426                2.538 ns/op
BenchmarkSliceFunctions/All-10-16                       351567169                3.197 ns/op
BenchmarkSliceFunctions/BackwardForLoop-10-16           466168953                2.540 ns/op
BenchmarkSliceFunctions/Backward-10-16                  66170752                16.75 ns/op
BenchmarkSliceFunctions/ValuesForLoop-10-16             468401443                2.541 ns/op
BenchmarkSliceFunctions/Values-10-16                    349017703                3.192 ns/op
BenchmarkSliceFunctions/AppendForLoop-10-16             10627237               104.5 ns/op
BenchmarkSliceFunctions/AppendSeq-10-16                  6952333               167.6 ns/op
BenchmarkSliceFunctions/CollectForLoop-10-16            63369013                19.99 ns/op
BenchmarkSliceFunctions/Collect-10-16                    6831514               173.1 ns/op
BenchmarkSliceFunctions/SortForLoop-10-16               35765378                30.09 ns/op
BenchmarkSliceFunctions/Sorted-10-16                     6520611               180.1 ns/op
BenchmarkSliceFunctions/ChunkForLoop-10-16              1000000000               0.6359 ns/op
BenchmarkSliceFunctions/Chunk-10-16                     201705656                5.958 ns/op
BenchmarkMapFunctions/AllForLoopMap-10-16               17116472                69.28 ns/op
BenchmarkMapFunctions/AllMap-10-16                      16828688                70.17 ns/op
BenchmarkMapFunctions/KeysForLoopMap-10-16              17014784                69.13 ns/op
BenchmarkMapFunctions/KeysMap-10-16                     17204522                69.60 ns/op
BenchmarkMapFunctions/ValuesForLoopMap-10-16            16612399                70.88 ns/op
BenchmarkMapFunctions/ValuesMap-10-16                   17106541                70.92 ns/op
BenchmarkMapFunctions/InsertForLoopMap-10-16             1397740               856.9 ns/op
BenchmarkMapFunctions/InsertMap-10-16                    1279608               956.4 ns/op
BenchmarkMapFunctions/CollectForLoopMap-10-16            3888702               298.7 ns/op
BenchmarkMapFunctions/CollectMap-10-16                   2691692               433.5 ns/op
PASS

zigo101 commented 3 months ago

The "collect"/"sort"/"insert" counterparts are not very equivalent. I removed them: https://go.dev/play/p/3BNJ1pgQNAE

$ gotv :tip version
[Run]: $HOME/.cache/gotv/bra_master/bin/go version
go version devel go1.24-4f18477db6 Thu Aug 22 13:19:44 2024 +0000 linux/amd64

$ gotv :tip test -bench=.
[Run]: $HOME/.cache/gotv/bra_master/bin/go test -bench=.  -benchmem
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
BenchmarkSliceFunctions/AllForLoop-10-4             125123698            9.555 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/All-10-4                    120565822           10.03 ns/op        0 B/op          0 allocs/op
BenchmarkSliceFunctions/BackwardForLoop-10-4        125155231            9.575 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Backward-10-4               25645136            45.58 ns/op        0 B/op          0 allocs/op
BenchmarkSliceFunctions/ValuesForLoop-10-4          169942302            7.087 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Values-10-4                 141663436            8.305 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/AppendForLoop-10-4           4608756           269.5 ns/op       248 B/op          5 allocs/op
BenchmarkSliceFunctions/AppendSeq-10-4               2763048           432.1 ns/op       312 B/op          8 allocs/op
BenchmarkSliceFunctions/ChunkForLoop-10-4           311692539            3.812 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Chunk-10-4                  70037104            15.69 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/AllForLoopMap-10-4             6592899           180.5 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/AllMap-10-4                    6519124           182.7 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/KeysForLoopMap-10-4            6884683           173.1 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/KeysMap-10-4                   6687552           178.2 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/ValuesForLoopMap-10-4          6934557           171.2 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/ValuesMap-10-4                 6493563           182.1 ns/op         0 B/op          0 allocs/op

It is some strange that appendSeq allocates more than appendForLoop. Their code looks equivalent.

cherrymui commented 3 months ago

go build -gcflags="-N -l -S" iter.go

Do not use -N -l to look at optimizations. -N -l is meant to generate slower code, by definition.

kscooo commented 3 months ago

@cherrymui I have used gotip, and the results are similar to 1.23.0. I have also tested the linux(x86) platform, the result is also similar

dr2chase commented 3 months ago

The problem with backward is that there's an inlining that doesn't happen, I think the Backward iterator has cost 81.

gopherbot commented 3 months ago

Change https://go.dev/cl/609095 mentions this issue: cmd/compile: tweak inlining to favor PPARAM-callers

dr2chase commented 1 month ago

With three recent CLs, there are some improvements:

BenchmarkSliceFunctions/AllForLoop-10-8             318656155            3.549 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/All-10-8                    284950826            4.232 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/BackwardForLoop-10-8        344576744            3.479 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Backward-10-8               285418107            4.454 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/ValuesForLoop-10-8          341475630            3.505 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Values-10-8                 279355890            4.203 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/AppendForLoop-10-8          15310840            76.63 ns/op      248 B/op          5 allocs/op
BenchmarkSliceFunctions/AppendSeq-10-8               9667617           125.3 ns/op       312 B/op          8 allocs/op
BenchmarkSliceFunctions/ChunkForLoop-10-8           1000000000           1.059 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Chunk-10-8                  645871693            1.863 ns/op           0 B/op          0 allocs/op
BenchmarkMapFunctions/AllForLoopMap-10-8            15707842            77.75 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/AllMap-10-8                   15588109            77.48 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/KeysForLoopMap-10-8           15874844            76.33 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/KeysMap-10-8                  15532279            77.45 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/ValuesForLoopMap-10-8         15944362            75.98 ns/op        0 B/op          0 allocs/op
BenchmarkMapFunctions/ValuesMap-10-8                15842664            76.84 ns/op        0 B/op          0 allocs/op

I notice that all the benchmarks involve trivial bodies. Microbenchmarks are useful for analyzing behavior, but the benchmarks we wish we had more of are those that model the behavior of real programs that have real costs.

zigo101 commented 1 week ago

It looks https://go-review.googlesource.com/c/go/+/629195 improves iterator performance much for the cases in https://github.com/golang/go/issues/69015#issuecomment-2304832499:

BenchmarkSliceFunctions/AllForLoop-10-4             380834220            9.397 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/All-10-4                    366582675            9.785 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/BackwardForLoop-10-4        381681043            9.446 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Backward-10-4               364835076            9.794 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/ValuesForLoop-10-4          511266987            7.327 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Values-10-4                 438282193            8.523 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/AppendForLoop-10-4          15299283           232.5 ns/op       248 B/op          5 allocs/op
BenchmarkSliceFunctions/AppendSeq-10-4              14804580           243.5 ns/op       248 B/op          5 allocs/op
BenchmarkSliceFunctions/ChunkForLoop-10-4           951433105            3.769 ns/op           0 B/op          0 allocs/op
BenchmarkSliceFunctions/Chunk-10-4                  551707360            6.412 ns/op           0 B/op          0 allocs/op
BenchmarkMapFunctions/AllForLoopMap-10-4            16889038           209.1 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/AllMap-10-4                   17028261           209.1 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/KeysForLoopMap-10-4           17537821           202.7 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/KeysMap-10-4                  17326242           203.3 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/ValuesForLoopMap-10-4         17882481           198.4 ns/op         0 B/op          0 allocs/op
BenchmarkMapFunctions/ValuesMap-10-4                17813592           202.3 ns/op         0 B/op          0 allocs/op

BTW, it looks that PR gives closures (like bar) a big advantage over declared functions (like foo) on inline costs:

func foo(...) ...

{ // in a function block
    var bar = func(...) ...

}

dr2chase commented 1 week ago

The intent was to help with rangefunc performance. If you find it necessary to play these games for code that matters (production code, etc), please let us know, because otherwise it's just an implementation detail that might change in a future release without regard for how people decided to depend on it.

That said, if a function is package-private and only called once, it might make sense boost its inlining-fu. (I just finished refactoring the inliner to help let us figure that out, but inlining in general remains unsatisfying.)

dr2chase commented 1 week ago

PS - on a weekend lark, did a crude test of "if a function is package-private and only called once..." and the crude test is definitely not ready for prime time; the program text got a lot (10%-ish) bigger, no idea if it does anything for performance at all. The other changes, for existing benchmarks which are backward looking (contain no rangefunc and are light on generics) were close enough to flat to not care.

By-the-way, if people are using generics or rangefunc in actually-performance-critical code, if there are benchmarks that model your use, those would be good to have, because lack of real-not-micro benchmarks for all the newer language features is a bother.

mvdan commented 1 week ago

Just a thought - if you want realistic benchmarks, perhaps ask through more public channels like the Go mailing list or twitter/bluesky, as I imagine only a handful of people subscribe to issues like this one.

dr2chase commented 1 week ago

As of 1.24 freeze, the differences observed in this benchmark appear to be caused by whether the benchmarked function gets inlined into the BenchmarkXXX function. This is still a rangefunc-related problem, but it is because the rangefunc loop looks too expensive to the inliner at the time the inliner calculates the inlining cost.

grep "cannot inline" foo.log
./test_test.go:24:6: cannot inline all: function too complex: cost 134 exceeds budget 80
./test_test.go:42:6: cannot inline backward: function too complex: cost 134 exceeds budget 80
./test_test.go:60:6: cannot inline values: function too complex: cost 128 exceeds budget 80
./test_test.go:78:6: cannot inline appendSeq: function too complex: cost 86 exceeds budget 80
./test_test.go:96:6: cannot inline chunk: function too complex: cost 136 exceeds budget 80
./test_test.go:115:6: cannot inline allMap: function too complex: cost 134 exceeds budget 80
./test_test.go:134:6: cannot inline keysMap: function too complex: cost 128 exceeds budget 80
./test_test.go:152:6: cannot inline valuesMap: function too complex: cost 128 exceeds budget 80

golang / go