runtime: performance degradation on tip that disappears when traced/profiled

JacobOaks commented 4 months ago

Go version

go version go1.22.4 darwin/arm64

Output of `go env` in your module/workspace:

GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/joaks/Library/Caches/go-build'
GOENV='/Users/joaks/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/joaks/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/joaks/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.22.4/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.22.4/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.22.4'
GCCGO='gccgo'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/Users/joaks/go/src/scratch/123_benchmarking/standalone_json/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/xj/2wbc4_xn293gkz7_6rzxsz5w0000gn/T/go-build3952519913=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

While doing some initial testing in preparation for Go 1.23 in the coming months, we ran some of Uber’s open-source benchmarks using the tip of Go.

What did you see happen?

We saw surprising degradations in a couple Zap benchmarks of around 20%. The most concerning of which was a simple test that measured concurrent calls to json.Marshal (nothing Zap-related). Here’s a simplified version of that benchmark:

const _size = 5000

func BenchmarkEncodingJSONStrings(b *testing.B) {
    s := make([]string, _size)
    for i := 0; i < len(s); i++ {
        s[i] = fmt.Sprint("00000000-0000-0000-0000-0000000000", i)
    }

    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            if _, err := json.Marshal(s); err != nil {
                b.Fatal(err)
            }
        }
    })
}

To reproduce:

$ go version
go version go1.22.4 darwin/arm64
$ go test -count 30 -bench . | tee go1224.log

$ ~/go/src/github.com/golang/go3/bin/go version
go version devel go1.23-93bbf719a6 Wed Jun 5 17:30:16 2024 +0000 darwin/arm64
$ ~/go/src/github.com/golang/go3/bin/go test -count 30 -bench . | tee head.log

$ benchstat go1224.log head.log
goos: darwin
goarch: arm64
pkg: standalonejson
cpu: Apple M1 Max
                       │ go1224.log  │              head.log               │
                       │   sec/op    │   sec/op     vs base                │
EncodingJSONStrings-10   43.48µ ± 2%   51.90µ ± 2%  +19.38% (p=0.000 n=30)

Some notes about reproducing:

The degradation only seems to manifest when running the benchmark in parallel. When parallelism is decreased, the magnitude of the degradation decreases, completely disappearing when run in serial.
Myself and @sywhang were able to reproduce this on darwin-arm64. We don’t have an isolated linux environment, so we aren’t able to say for sure if it reproduced on linux - but based on anecdotal evidence it does seem to.
This does not seem to be isolated to encoding/json. Several Zap benchmarks degraded due to the same commit:
- ~20% - BenchmarkZapJSON - benchmarks Zap’s JSON encoder
- ~5% - BenchmarkZapJSONFloat32AndComplex64
- ~20% - BenchmarkStandardJSON - benchmarks encoding/json.Marshal

Our Investigation

Using git bisect revealed that this commit is what is causing the degradation in the benchmarks.

This is a surprising result as the offending commit seems entirely unrelated to the benchmarks. We tried to see how this new code could be getting hit at all, and were unable to find a link via dlv breakpoints nor print statements in the runtime package.

We did try to collect CPU profiles while running the benchmarks to investigate further but interestingly, the degradation disappears when collecting a CPU profile.

$ ~/go/src/github.com/golang/go3/bin/go version
go version devel go1.23-93bbf719a6 Wed Jun 5 17:30:16 2024 +0000 darwin/arm64
$ ~/go/src/github.com/golang/go3/bin/go test -count 30 -cpuprofile headcpu.prof -bench . | tee head-prof.log
goos: darwin
goarch: arm64
pkg: standalonejson
cpu: Apple M1 Max
                       │ go1224.log  │         head-prof.log         │
                       │   sec/op    │   sec/op     vs base          │
EncodingJSONStrings-10   43.48µ ± 2%   43.31µ ± 1%  ~ (p=0.495 n=30)

Since this issue seems to be unrelated to Zap specifically, and reproduces with calls to encoding/json, we wanted to bring it up in case it points to a more general issue.

What did you expect to see?

No practical degradation.

mknyszek commented 4 months ago

We've seen this particular commit as a culprit as well, but as you point out, this code never actually runs during the benchmarks. You need tracing enabled for it to do anything, and it does help tracing overheads, so I'm reluctant to roll it back.

In these sorts of situations, I'd usually blame the benchmark for being too small. For example, the CL causes some code to move across cache line boundaries (or something like that), perturbing some super tight loop that is hit very often. Given that you're seeing this on arm64 I'm inclined to say it's not quite that exactly, but some kind of unfortunate microarchitectural effect.

I was meaning to look into this (and still will), but given that you're already doing some work on it, are you able to dig deeper and see what differences are showing up at the microarchitectural level (say, with Linux perf or something)? A 20% regression should stick out like a sore thumb when comparing Linux perf results (assuming that that doesn't also eliminate the regression).

gabyhelp commented 4 months ago

Similar Issues

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

JacobOaks commented 3 months ago

Hey @mknyszek - thanks for the response.

We were able to grab an isolated linux machine w/ 24 cores and do some more investigation. To reduce any possible noise, we moved the code out of a benchmark environment and into a standalone program that ran the same call to Marshal with the same parallelism factor. The degradation on this particular machine was around 5-10%.

Firstly, we looked at runtime/metrics.

There were around ~1900 GC cycles when running with the "bad" commit, vs. only ~1300 on the good commit. This is a bit puzzling since the heap goal and GOGC remained the same between the two. /sched/pauses/total/gc:seconds also remains the same, interestingly.
/gc/scan/globals:bytes went from about 400k to 150k in the bad commit. I'm not entirely sure how to interpret that, other than perhaps the bad version of the program received less memory from the OS when being run.
/sync/mutex/wait/total:seconds went from about 1.5s to 2.0s.

Secondly, we looked at linux perf results. When running multiple times, the values that stayed consistently higher in the "bad" commit were cpu-migrations and context switches. For example:

Running on 73981695a2518b6eae7f8ffe74d224691c60d433 (good):

    104,213,454   cache-misses                           
    575,652,872   branch-misses                       
      16,792      migrations                         
     1,894,707    page-faults                        
      632,675     context-switches                      

    5.268542468 seconds time elapsed

   83.043335000 seconds user
    5.554596000 seconds sys

on 5bba5b256ce287736da60372c5cf634395d7b1a3 (bad):

    127,540,270   cache-misses                            
    558,647,202   branch-misses                       
      25,348      migrations                         
     2,648,707    page-faults                        
      710,730     context-switches                      

    5.459916774 seconds time elapsed

   82.700777000 seconds user
    6.758844000 seconds sys

Collecting a CPU profile did seem to make the migrations go back to "good" levels, but we have yet to fully investigate profiling vs. not profiling. We noticed other commits were seeing better performance when profiling too, so we are partially baffled by this atm.

Finally, running with link-time randomization (as @thanm did in #67858) did not seem to have any effect.

At this point, one theory we have is that for some reason threads are yielding more frequently (including GC threads), which is causing more frequent GC cycles. This could be due to some increased lock contention somewhere (since we see such a large increase in mutex wait time)? Or maybe the OS is forcing yield more frequently for some reason? Still unsure how the bisected commit would cause either though.

mknyszek commented 3 months ago

Thanks for doing a more in-depth analysis. I think I know what the problem is, and the change in /gc/scan/globals:bytes is what tipped me off.

The old trace map implementation kept a substantial amount of memory in an array. These maps are only defined in globals, and there's about 4 of them if I recall correctly (6 on tip, I think). Even though the pointers in the map are to not-in-heap memory, the pointers are just regular unsafe pointers, so the big array appears as scannable memory in globals to the GC. Because scannable globals contribute to the heap goal since Go 1.18, I think their disappearance is the likely the source of the effect.

My first guess is that these extra globals in the "good" commit are behaving like a ballast. This theory is consistent with the change in # of GC cycles. What's odd is that the heap goal remains the same, but it may just be a rounding error in the gctrace output since it always rounds down to the nearest MiB, and we're talking about a 300 KiB difference here. I think the change in # of GC cycles explains basically everything else.

Would you be able to share the STDERR of running your benchmark with GODEBUG=gctrace=1,gcpacertrace=1? That should be enough to help me confirm that the heap is just slightly different in size and confirm the theory.

If this theory is right, then unfortunately, this is exactly one of those artifacts that only really affects programs with really small live heaps that allocate a lot. If I am right, I do not think we should do anything about it. (That is, I do not think it is worthwhile to restore the ballast.) What may prove my theory wrong is that 300 KiB is still very little. But I do think this is still in the right general direction.

(This may also explain the go build regressions somewhat, since they are also sensitive to these kinds of ballast effects; lots of very little Go processes are created during go build. However, I thought a recent (Go 1.20?) change would have totally mitigated this. Needs more investigation, but just wanted to note that it may be related.)

JacobOaks commented 3 months ago

Ah - interesting. I definitely was misinterpreting what /gc/scan/globals:bytes meant. That makes sense.

I was able to trace GC and found evidence that supports your theory, though I'm not too sure if the absolute amount of discrepancy is what is expected, and the consistent 0MB globals throws me off. Seems like during GC for the "bad" commit, the mean (taken of the rounded values in the output) heap size and goal are 3.3 MB and 7.3 MB, respectively, while for the "good" commit it's around 4.5 and 10 MB, respectively. Here's the full output for bad vs. good, if you'd like to take a look (just fyi - this was taken on the machine I used to post the original issue, not the linux machine I used for my most recent comment)

I'm supposing that when I collected runtime/metrics before, the heap goal at that point in time (after the benchmark had completed) for both good & bad happened to be the minimum heap size of 4 MB.

I do not think we should do anything about it

Yeah absolutely - I think the issue - while worth investigation - is definitely a very specific scenario that does not represent real applications with sizeable heaps.

mknyszek commented 3 months ago

and the consistent 0MB globals throws me off

Yeah, thats likely an artifact of rounding down.

Seems like during GC for the "bad" commit, the mean (taken of the rounded values in the output) heap size and goal are 3.3 MB and 7.3 MB, respectively, while for the "good" commit it's around 4.5 and 10 MB, respectively.

Yep, that sounds like it's the problem. It makes sense that the live heap size also might be slightly larger as a side-effect, since everything allocated during the mark phase is counted as part of the next cycle's live heap. If there's more for the GC to do, it's a longer mark phase, and thus a bigger live heap.

Here's the full output for bad vs. good,

Thanks! I'll poke at it.

golang / go