Closed aclements closed 3 months ago
FWIW, we've noticed this at Uber while doing some benchmarking recently. I've posted two possibly related but more specific issues spawned from Zap's benchmarks - #67857 and #67858.
Moving this to okay-after-rc1 so we can gather more data in the RC1 concurrently with investigating these regressions. (I also don't think the regressions, primarily in microbenchmarks, are so severe that they should block an RC. But we should still get to the bottom of this and either fix it or make an explicit decision before the release.)
Hi all,
I spent some time looking at this issue, specifically at one of the benchmarks that has regressed, BenchmarkHistogramAllocation
from https://github.com/uber-go/tally
.
TLDR: some portion of this slowdown appears to be due to alignment/layout artifacts, since a good chunk of it goes away when I randomize the order of functions in the text section.
Here's what I did as an experiment:
prepared two Go repos, one selecting tip of master, and the other selecting the tip of the Go 1.22 release branch.
in the Go 1.22 repo, I cherry-picked in CL 562157, the recently added "-randlayout" linker option, which randomly shuffles the order of functions in the text section.
I did two sets of runs, one set comparing tip vs 1.22 without any additional compiler or linker flags. I then did a set of runs using "-randlayout" for both 1.22 and tip.
This is on a 2-socket 96-core cloudtop VM.
Here's the benchstat output for the base comparison (not showing "B/op" or "allocs/op" since there is no difference there):
goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
│ out.base.txt │ out.tip.txt │
│ sec/op │ sec/op vs base │
HistogramAllocation-96 2.230µ ± 1% 2.318µ ± 1% +3.95% (p=0.000 n=100)
Here's what I get with randomized layout:
goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
│ rout.base.txt │ rout.tip.txt │
│ sec/op │ sec/op vs base │
HistogramAllocation-96 2.229µ ± 1% 2.261µ ± 1% +1.44% (p=0.006 n=100)
The full script I am using is below. When I get a chance later today I'll try a couple of the other benchmarks.
#!/bin/sh
set -x
set -e
#
VERSIONS="/w/go.master:tip /w/go1.22:base"
#
function doruns() {
local GR=$1
local TAG=$2
local DORAND=$3
SEEDS="101 303 505 707 909 11011 13013 15015 17017 19019"
SAVEPATH="$PATH"
export PATH="${GR}/bin:${PATH}"
BARGS="-run=nothing -bench=BenchmarkHistogramAllocation"
OTAG="out"
if [ $DORAND = "yes" ]; then
OTAG="rout"
fi
# Run benchmark to produce output.
rm -f ${OTAG}.${TAG}.txt
for S in $SEEDS
do
EXTRA=""
if [ $DORAND = "yes" ]; then
EXTRA="-ldflags=-randlayout=${S}"
fi
go test $EXTRA -count=10 $BARGS >> ${OTAG}.${TAG}.txt
done
export PATH="${SAVEPATH}"
}
#
SCENARIOS="no yes"
for DORAND in $SCENARIOS
do
for V in $VERSIONS
do
GR=`echo $V | cut -f1 -d:`
TAG=`echo $V | cut -f2 -d:`
doruns $GR $TAG $DORAND
done
done
benchstat out.base.txt out.tip.txt > cmp.txt
benchstat rout.base.txt rout.tip.txt > rcmp.txt
I repeated my experiment with BenchmarkMultiWriteSyncer
from
go.uber.org/zap/zapcore
. Results below.
Without randomization:
goos: linux
goarch: amd64
pkg: go.uber.org/zap/zapcore
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
│ out.base.txt │ out.tip.txt │
│ sec/op │ sec/op vs base │
MultiWriteSyncer/2_discarder-96 13.12n ± 2% 14.21n ± 2% +8.30% (p=0.000 n=100)
MultiWriteSyncer/4_discarder-96 14.31n ± 1% 15.07n ± 1% +5.35% (p=0.000 n=100)
MultiWriteSyncer/4_discarder_with_buffer-96 225.5n ± 2% 232.6n ± 1% +3.15% (p=0.004 n=100)
geomean 34.86n 36.80n +5.58%
With randomization:
goos: linux
goarch: amd64
pkg: go.uber.org/zap/zapcore
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
│ rout.base.txt │ rout.tip.txt │
│ sec/op │ sec/op vs base │
MultiWriteSyncer/2_discarder-96 13.72n ± 2% 13.86n ± 2% ~ (p=0.086 n=100)
MultiWriteSyncer/4_discarder-96 14.61n ± 2% 15.32n ± 1% +4.89% (p=0.000 n=100)
MultiWriteSyncer/4_discarder_with_buffer-96 231.0n ± 2% 230.2n ± 2% ~ (p=0.793 n=100)
geomean 35.91n 36.57n +1.83%
so still a 5% slowdown for MultiWriteSyncer/4_discarder-96
even after layout is taken into account. I will try to dig into that a bit.
For MultiWriteSyncer/4_discarder
I am really not seeing much that is actionable. I ran "pprof disasm" on the hot functions and from what I can tell comparing 1.22 and tip the code hasn't changed in any meaningful way between the releases. Puzzling.
I think I understand the root cause of the BenchmarkHistogramAllocation
. See my comment at https://github.com/golang/go/issues/67857#issuecomment-2163817236. In sum: instances of the old trace map have a significant-enough scannable globals footprint to affect microbenchmarks that allocate memory rapidly. (256 KiB footprint, 819284 bytes on 64-bit platforms.) These globals can basically end up acting as a ballast if the live heap is just high enough to push us over the minimum heap size (like, 2.5 MiB, in which case 256 KiB is ~10% of the live heap disappearing in the trace map CL I landed).
For Pi/foo=shopspring/prec=100-16 de3a3c9 jumps out as a possible culprit, for around 2pp of the regression. The only thing that CL may have made a little slower is zeroing large pointerful things; perhaps it's allocating large pointerful things frequently? I can take a closer look.
I spent some time today investigating Pi/foo=shopspring/prec=100-16 at de3a3c9 but nothing is jumping out at me.
After eliminating GC from the equation (GOMEMLIMIT=1GiB GOGC=off
, confirmed no GCs run during timed section) the regression still persisted. I then tried 3 random link-time code layouts before and after de3a3c9 and when I aggregated those runs, the difference went away. I'm inclined to believe then that this is not worth worrying about.
Discussed in the weekly release checkin meeting -- the consensus is that we are in ok shape, just leaving this open for now
Closing, before first build for 1.23 final, based on discussion above.
The performance dashboard shows a general, across-the-board regression from the last release (1.22) to current tip. The worst regressions are on the 88 core machine, For example:
Unfortunately, the 88 core history doesn't go far enough back to identify the original source. There are a few clear regressions on the 16 core machine, where we can go back. For example Pi/foo=shopspring/prec=100-16 and HistogramAllocation-16. It's hard to finger an obvious CL in these cases, but it may give us a narrower window to reproduce on the 88 core machine.
cc @mknyszek @golang/runtime