Open kscooo opened 2 months ago
Related Issues and Documentation
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Could you try Go at tip of the master branch? There were some recent optimizations that didn't make into Go 1.23. Thanks!
cc @golang/compiler @dr2chase
cpu: Apple M3 Pro
Also, @dr2chase discovered that Apple Silicon chips may have some weird performance behaviors that we don't yet understand. Specifically, the inner loop runs faster if its address range crosses a 4K boundary, which intuitively we'd expect it to be slower... Could you try running on a different machine? And try building with -ldflags=-randlayout=N
(where N is some nonzero integer) for some randomized address layout? Thanks.
I use go version devel go1.24-4f18477d Thu Aug 22 13:19:44 2024 +0000 windows/amd64 run benckmark, I got:
goos: windows goarch: amd64 cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics BenchmarkSliceFunctions/AllForLoop-10-16 464714426 2.538 ns/op BenchmarkSliceFunctions/All-10-16 351567169 3.197 ns/op BenchmarkSliceFunctions/BackwardForLoop-10-16 466168953 2.540 ns/op BenchmarkSliceFunctions/Backward-10-16 66170752 16.75 ns/op BenchmarkSliceFunctions/ValuesForLoop-10-16 468401443 2.541 ns/op BenchmarkSliceFunctions/Values-10-16 349017703 3.192 ns/op BenchmarkSliceFunctions/AppendForLoop-10-16 10627237 104.5 ns/op BenchmarkSliceFunctions/AppendSeq-10-16 6952333 167.6 ns/op BenchmarkSliceFunctions/CollectForLoop-10-16 63369013 19.99 ns/op BenchmarkSliceFunctions/Collect-10-16 6831514 173.1 ns/op BenchmarkSliceFunctions/SortForLoop-10-16 35765378 30.09 ns/op BenchmarkSliceFunctions/Sorted-10-16 6520611 180.1 ns/op BenchmarkSliceFunctions/ChunkForLoop-10-16 1000000000 0.6359 ns/op BenchmarkSliceFunctions/Chunk-10-16 201705656 5.958 ns/op BenchmarkMapFunctions/AllForLoopMap-10-16 17116472 69.28 ns/op BenchmarkMapFunctions/AllMap-10-16 16828688 70.17 ns/op BenchmarkMapFunctions/KeysForLoopMap-10-16 17014784 69.13 ns/op BenchmarkMapFunctions/KeysMap-10-16 17204522 69.60 ns/op BenchmarkMapFunctions/ValuesForLoopMap-10-16 16612399 70.88 ns/op BenchmarkMapFunctions/ValuesMap-10-16 17106541 70.92 ns/op BenchmarkMapFunctions/InsertForLoopMap-10-16 1397740 856.9 ns/op BenchmarkMapFunctions/InsertMap-10-16 1279608 956.4 ns/op BenchmarkMapFunctions/CollectForLoopMap-10-16 3888702 298.7 ns/op BenchmarkMapFunctions/CollectMap-10-16 2691692 433.5 ns/op PASS
The "collect"/"sort"/"insert" counterparts are not very equivalent. I removed them: https://go.dev/play/p/3BNJ1pgQNAE
$ gotv :tip version
[Run]: $HOME/.cache/gotv/bra_master/bin/go version
go version devel go1.24-4f18477db6 Thu Aug 22 13:19:44 2024 +0000 linux/amd64
$ gotv :tip test -bench=.
[Run]: $HOME/.cache/gotv/bra_master/bin/go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
BenchmarkSliceFunctions/AllForLoop-10-4 125123698 9.555 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/All-10-4 120565822 10.03 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/BackwardForLoop-10-4 125155231 9.575 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/Backward-10-4 25645136 45.58 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/ValuesForLoop-10-4 169942302 7.087 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/Values-10-4 141663436 8.305 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/AppendForLoop-10-4 4608756 269.5 ns/op 248 B/op 5 allocs/op
BenchmarkSliceFunctions/AppendSeq-10-4 2763048 432.1 ns/op 312 B/op 8 allocs/op
BenchmarkSliceFunctions/ChunkForLoop-10-4 311692539 3.812 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/Chunk-10-4 70037104 15.69 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/AllForLoopMap-10-4 6592899 180.5 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/AllMap-10-4 6519124 182.7 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/KeysForLoopMap-10-4 6884683 173.1 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/KeysMap-10-4 6687552 178.2 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/ValuesForLoopMap-10-4 6934557 171.2 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/ValuesMap-10-4 6493563 182.1 ns/op 0 B/op 0 allocs/op
It is some strange that appendSeq
allocates more than appendForLoop
. Their code looks equivalent.
go build -gcflags="-N -l -S" iter.go
Do not use -N -l
to look at optimizations. -N -l
is meant to generate slower code, by definition.
@cherrymui I have used gotip, and the results are similar to 1.23.0. I have also tested the linux(x86) platform, the result is also similar
The problem with backward is that there's an inlining that doesn't happen, I think the Backward iterator has cost 81.
Change https://go.dev/cl/609095 mentions this issue: cmd/compile: tweak inlining to favor PPARAM-callers
With three recent CLs, there are some improvements:
BenchmarkSliceFunctions/AllForLoop-10-8 318656155 3.549 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/All-10-8 284950826 4.232 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/BackwardForLoop-10-8 344576744 3.479 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/Backward-10-8 285418107 4.454 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/ValuesForLoop-10-8 341475630 3.505 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/Values-10-8 279355890 4.203 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/AppendForLoop-10-8 15310840 76.63 ns/op 248 B/op 5 allocs/op
BenchmarkSliceFunctions/AppendSeq-10-8 9667617 125.3 ns/op 312 B/op 8 allocs/op
BenchmarkSliceFunctions/ChunkForLoop-10-8 1000000000 1.059 ns/op 0 B/op 0 allocs/op
BenchmarkSliceFunctions/Chunk-10-8 645871693 1.863 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/AllForLoopMap-10-8 15707842 77.75 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/AllMap-10-8 15588109 77.48 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/KeysForLoopMap-10-8 15874844 76.33 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/KeysMap-10-8 15532279 77.45 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/ValuesForLoopMap-10-8 15944362 75.98 ns/op 0 B/op 0 allocs/op
BenchmarkMapFunctions/ValuesMap-10-8 15842664 76.84 ns/op 0 B/op 0 allocs/op
I notice that all the benchmarks involve trivial bodies. Microbenchmarks are useful for analyzing behavior, but the benchmarks we wish we had more of are those that model the behavior of real programs that have real costs.
Go version
go version go1.23.0 darwin/arm64(gotip too)
Output of
go env
in your module/workspace:What did you do?
Related Go files:
iter: https://go.dev/play/p/iRuU4kNXngq iter_test: https://go.dev/play/p/4C_EbsSnlQH
Linux machines and x86 will also be a bit slower. Gotip was also used, with similar results.
Additionally, when examining the assembly output generated by
I noticed that certain functions contain additional instructions that appear to be unnecessary, which could be contributing to the observed performance differences.
What did you see happen?
Analysis of the generated assembly revealed that iterator-based implementations (e.g.,
slices.All
,slices.Backward
,slices.Chunk
) introduce additional overhead compared to traditional for-loops:Additional function calls:
Memory allocations:
runtime.newobject
)Additional control flow:
Indirect function calls:
CALL (R4)
observed in thechunk
function)Increased register usage and stack operations:
Additional safety checks:
slices.Chunk
Increased code size:
Specifically for
slices.Chunk
observed:runtime.newobject
calls for creating closure objectsslices.Chunk[go.shape.[]int,go.shape.int].func1
Similar issues were observed in other iterator-related function implementations.
What did you expect to see?
According to the Go Wiki's Rangefunc Experiment documentation, the optimized code structure in simple cases is almost identical to a manually written for loop.
However, assembly analysis suggests that the current implementations may introduce complexity and potential performance overhead. While these implementations are already quite effective, there is hope that further optimizations could align their performance with traditional for loops in most simple scenarios.