golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.4k stars 17.71k forks source link

runtime: general slowdown relative from 1.22 to tip (1.23 candidate) #67822

Closed aclements closed 3 months ago

aclements commented 6 months ago

The performance dashboard shows a general, across-the-board regression from the last release (1.22) to current tip. The worst regressions are on the 88 core machine, For example: image

Unfortunately, the 88 core history doesn't go far enough back to identify the original source. There are a few clear regressions on the 16 core machine, where we can go back. For example Pi/foo=shopspring/prec=100-16 and HistogramAllocation-16. It's hard to finger an obvious CL in these cases, but it may give us a narrower window to reproduce on the 88 core machine.

cc @mknyszek @golang/runtime

JacobOaks commented 6 months ago

FWIW, we've noticed this at Uber while doing some benchmarking recently. I've posted two possibly related but more specific issues spawned from Zap's benchmarks - #67857 and #67858.

mknyszek commented 5 months ago

Moving this to okay-after-rc1 so we can gather more data in the RC1 concurrently with investigating these regressions. (I also don't think the regressions, primarily in microbenchmarks, are so severe that they should block an RC. But we should still get to the bottom of this and either fix it or make an explicit decision before the release.)

thanm commented 5 months ago

Hi all,

I spent some time looking at this issue, specifically at one of the benchmarks that has regressed, BenchmarkHistogramAllocation from https://github.com/uber-go/tally.

TLDR: some portion of this slowdown appears to be due to alignment/layout artifacts, since a good chunk of it goes away when I randomize the order of functions in the text section.

Here's what I did as an experiment:

This is on a 2-socket 96-core cloudtop VM.

Here's the benchstat output for the base comparison (not showing "B/op" or "allocs/op" since there is no difference there):

goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                       │ out.base.txt │             out.tip.txt             │
                       │    sec/op    │   sec/op     vs base                │
HistogramAllocation-96    2.230µ ± 1%   2.318µ ± 1%  +3.95% (p=0.000 n=100)

Here's what I get with randomized layout:

goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                       │ rout.base.txt │            rout.tip.txt             │
                       │    sec/op     │   sec/op     vs base                │
HistogramAllocation-96     2.229µ ± 1%   2.261µ ± 1%  +1.44% (p=0.006 n=100)

The full script I am using is below. When I get a chance later today I'll try a couple of the other benchmarks.

#!/bin/sh
set -x
set -e
#
VERSIONS="/w/go.master:tip /w/go1.22:base"
#
function doruns() {
  local GR=$1
  local TAG=$2
  local DORAND=$3

  SEEDS="101 303 505 707 909 11011 13013 15015 17017 19019"

  SAVEPATH="$PATH"
  export PATH="${GR}/bin:${PATH}"

  BARGS="-run=nothing -bench=BenchmarkHistogramAllocation"

  OTAG="out"
  if [ $DORAND = "yes" ]; then
    OTAG="rout"
  fi

  # Run benchmark to produce output.
  rm -f ${OTAG}.${TAG}.txt
  for S in $SEEDS
  do
    EXTRA=""
    if [ $DORAND = "yes" ]; then
      EXTRA="-ldflags=-randlayout=${S}"
    fi
    go test $EXTRA -count=10 $BARGS >> ${OTAG}.${TAG}.txt
  done

  export PATH="${SAVEPATH}"
}
#
SCENARIOS="no yes"
for DORAND in $SCENARIOS
do            
  for V in $VERSIONS
  do
    GR=`echo $V | cut -f1 -d:`
    TAG=`echo $V | cut -f2 -d:`
    doruns $GR $TAG $DORAND
  done
done
benchstat out.base.txt out.tip.txt > cmp.txt
benchstat rout.base.txt rout.tip.txt > rcmp.txt
thanm commented 5 months ago

I repeated my experiment with BenchmarkMultiWriteSyncer from go.uber.org/zap/zapcore. Results below.

Without randomization:

goos: linux
goarch: amd64
pkg: go.uber.org/zap/zapcore
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                                            │ out.base.txt │             out.tip.txt             │
                                            │    sec/op    │   sec/op     vs base                │
MultiWriteSyncer/2_discarder-96                13.12n ± 2%   14.21n ± 2%  +8.30% (p=0.000 n=100)
MultiWriteSyncer/4_discarder-96                14.31n ± 1%   15.07n ± 1%  +5.35% (p=0.000 n=100)
MultiWriteSyncer/4_discarder_with_buffer-96    225.5n ± 2%   232.6n ± 1%  +3.15% (p=0.004 n=100)
geomean                                        34.86n        36.80n       +5.58%

With randomization:

goos: linux
goarch: amd64
pkg: go.uber.org/zap/zapcore
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                                            │ rout.base.txt │            rout.tip.txt             │
                                            │    sec/op     │   sec/op     vs base                │
MultiWriteSyncer/2_discarder-96                 13.72n ± 2%   13.86n ± 2%       ~ (p=0.086 n=100)
MultiWriteSyncer/4_discarder-96                 14.61n ± 2%   15.32n ± 1%  +4.89% (p=0.000 n=100)
MultiWriteSyncer/4_discarder_with_buffer-96     231.0n ± 2%   230.2n ± 2%       ~ (p=0.793 n=100)
geomean                                         35.91n        36.57n       +1.83%

so still a 5% slowdown for MultiWriteSyncer/4_discarder-96 even after layout is taken into account. I will try to dig into that a bit.

thanm commented 5 months ago

For MultiWriteSyncer/4_discarder I am really not seeing much that is actionable. I ran "pprof disasm" on the hot functions and from what I can tell comparing 1.22 and tip the code hasn't changed in any meaningful way between the releases. Puzzling.

mknyszek commented 5 months ago

I think I understand the root cause of the BenchmarkHistogramAllocation. See my comment at https://github.com/golang/go/issues/67857#issuecomment-2163817236. In sum: instances of the old trace map have a significant-enough scannable globals footprint to affect microbenchmarks that allocate memory rapidly. (256 KiB footprint, 819284 bytes on 64-bit platforms.) These globals can basically end up acting as a ballast if the live heap is just high enough to push us over the minimum heap size (like, 2.5 MiB, in which case 256 KiB is ~10% of the live heap disappearing in the trace map CL I landed).

mknyszek commented 5 months ago

For Pi/foo=shopspring/prec=100-16 de3a3c9 jumps out as a possible culprit, for around 2pp of the regression. The only thing that CL may have made a little slower is zeroing large pointerful things; perhaps it's allocating large pointerful things frequently? I can take a closer look.

mknyszek commented 5 months ago

I spent some time today investigating Pi/foo=shopspring/prec=100-16 at de3a3c9 but nothing is jumping out at me.

After eliminating GC from the equation (GOMEMLIMIT=1GiB GOGC=off, confirmed no GCs run during timed section) the regression still persisted. I then tried 3 random link-time code layouts before and after de3a3c9 and when I aggregated those runs, the difference went away. I'm inclined to believe then that this is not worth worrying about.

thanm commented 5 months ago

Discussed in the weekly release checkin meeting -- the consensus is that we are in ok shape, just leaving this open for now

dr2chase commented 3 months ago

Closing, before first build for 1.23 final, based on discussion above.