golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.06k stars 17.68k forks source link

runtime: heap target increased by significantly more than GOGC should allow #67592

Closed MikeMitchellWebDev closed 5 months ago

MikeMitchellWebDev commented 5 months ago

Go version

go version go1.22.0 darwin/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/Users/mm/Library/Caches/go-build'
GOENV='/Users/mm/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/mm/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/mm/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/darwin_amd64'
GOVCS=''
GOVERSION='go1.22.0'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='/usr/bin/clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/Users/mm/go/src/github.com/MikeMitchellWebDev/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/5y/wtzzmjlj5v52pg7wr8ptbg_m0000gp/T/go-build327897342=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

GODEBUG=gctrace=1, gcpacer=1 GOGC=3 ./myprogram

What did you see happen?

the heap target steadily increased by 3 % and then suddenly by more than 100% at gc 122

gc 118 @10.223s 1%: 0.11+5.8+0.076 ms clock, 0.44+0.18/4.1/2.8+0.30 ms cpu, 349->350->341 MB, 351 MB goal, 0 MB stacks, 0 MB globals, 4 P pacer: assist ratio=+1.228282e+000 (scan 1 MB in 350->352 MB) workers=1++0.000000e+000 pacer: 25% CPU (25 exp.) for 1393992+15296+488544 B work (1851744 B exp.) in 367747888 B -> 368612400 B (∆goal -643077, cons/mark +2.721960e-001) gc 119 @10.411s 1%: 0.70+10+0.10 ms clock, 2.8+0.20/9.9/7.6+0.41 ms cpu, 350->351->342 MB, 352 MB goal, 0 MB stacks, 0 MB globals, 4 P pacer: assist ratio=+1.225035e+000 (scan 1 MB in 351->352 MB) workers=1++0.000000e+000 pacer: 26% CPU (25 exp.) for 1412112+15912+488544 B work (1897832 B exp.) in 368469776 B -> 368682136 B (∆goal -1336846, cons/mark +2.721960e-001) gc 120 @10.533s 1%: 0.10+2.7+0.044 ms clock, 0.41+0.13/2.5/3.5+0.17 ms cpu, 351->351->343 MB, 352 MB goal, 0 MB stacks, 0 MB globals, 4 P pacer: assist ratio=+1.230030e+000 (scan 1 MB in 351->353 MB) workers=1++0.000000e+000 pacer: 26% CPU (25 exp.) for 1429360+14784+488544 B work (1916568 B exp.) in 369003952 B -> 369732384 B (∆goal -829715, cons/mark +2.721960e-001) gc 121 @10.598s 1%: 0.10+2.6+0.066 ms clock, 0.41+0.17/2.3/3.6+0.26 ms cpu, 351->352->343 MB, 353 MB goal, 0 MB stacks, 0 MB globals, 4 P pacer: assist ratio=+2.949048e+001 (scan 1 MB in 861->861 MB) workers=1++0.000000e+000 gc 122 @10.670s 1%: 0.085+2.3+0.019 ms clock, 0.34+0.15/2.1/3.3+0.078 ms cpu, 861->861->853 MB, 861 MB goal, 0 MB stacks, 0 MB globals, 4 P pacer: assist ratio=+1.093284e+000 (scan 4 MB in 877->879 MB) workers=1++0.000000e+000 pacer: 27% CPU (25 exp.) for 2531768+15024+488544 B work (1965256 B exp.) in 920500992 B -> 920976848 B (∆goal -1321715, cons/mark +3.051929e-001)

What did you expect to see?

The heap goal to consistenly increase by 3%

mknyszek commented 5 months ago

It's hard to tell what's going wrong without a way to reproduce or see what myprogram is doing. Are you able to share the code or some code that consistently reproduces the issue? FWIW, to my knowledge, a dramatic spike like this hasn't been reported before (in production or even a test), which makes me curious about what the program is actually doing.

Note that generally, it is possible for the program to overrun the heap goal by at least a little bit. However that overrun is typically visible in the gctrace output (Y in X->Y->Z), whereas this appears to be a change in heap goal that comes from apparently nowhere, akin to if SetGCPercent was called in between GC 121 and GC 122. (Not saying that's what happened, just providing an example of how such output might show up.)

MikeMitchellWebDev commented 5 months ago

I can't share code but I can tell you exactly what I did. I used the go-git library to read the linux repository into memory (it has more than 1 million commits). Furthermore, this issue happened each time I did it. I tried it again several times when I noticed the spike, so it wasn't just a random occurrence. It should be said that it's totally unnecessary to set GOGC=3 to read the repo into memory; it actually performed much better using the default GOGC.

Here's the rest of the gctrace output from gc 123. You can see the heap goal continues incrementing by 3% after that one dramatic spike.

My computer has 8GiB of RAM and GOMAXPROCS was 4.

gc 123 @14.599s 1%: 0.085+4.7+0.045 ms clock, 0.34+0.44/3.7/7.6+0.18 ms cpu, 877->878->856 MB, 879 MB goal, 0 MB stacks, 0 MB globals, 4 P
pacer: assist ratio=+1.095070e+000 (scan 3 MB in 879->881 MB) workers=1++0.000000e+000
pacer: 28% CPU (25 exp.) for 2600392+15024+488544 B work (3035336 B exp.) in 922007824 B -> 922351632 B (∆goal -2428012, cons/mark +3.051929e-001)
gc 124 @14.905s 1%: 0.075+5.2+0.058 ms clock, 0.30+0.71/5.1/3.8+0.23 ms cpu, 879->879->857 MB, 881 MB goal, 0 MB stacks, 0 MB globals, 4 P
pacer: assist ratio=+1.095246e+000 (scan 3 MB in 881->883 MB) workers=1++0.000000e+000
pacer: 25% CPU (25 exp.) for 2668848+21304+488544 B work (3103960 B exp.) in 923814224 B -> 924136768 B (∆goal -2511486, cons/mark +3.051929e-001)
gc 125 @15.199s 1%: 0.083+9.1+0.10 ms clock, 0.33+0.12/8.9/8.3+0.42 ms cpu, 881->881->859 MB, 883 MB goal, 0 MB stacks, 0 MB globals, 4 P
pacer: assist ratio=+1.093189e+000 (scan 3 MB in 882->884 MB) workers=1++0.000000e+000
pacer: 25% CPU (25 exp.) for 2748424+15720+488544 B work (3178696 B exp.) in 924900032 B -> 925290176 B (∆goal -2517584, cons/mark +3.051929e-001)
gc 126 @15.429s 1%: 0.10+5.7+0.064 ms clock, 0.41+0.16/4.8/5.7+0.25 ms cpu, 882->882->860 MB, 884 MB goal, 0 MB stacks, 0 MB globals, 4 P
pacer: assist ratio=+1.094132e+000 (scan 3 MB in 883->886 MB) workers=1++0.000000e+000
pacer: 25% CPU (25 exp.) for 2930792+15736+488544 B work (3252688 B exp.) in 926862544 B -> 927727544 B (∆goal -2107847, cons/mark +3.051929e-001)
MikeMitchellWebDev commented 5 months ago

@seankhliao According to Github, it seems like the WaitingForInfo label was added at the same time as I provided the requested info.

mknyszek commented 5 months ago

Given the additional logs, I am skeptical that this is a GC bug. My current hypothesis is that your application is making a single ~500 MiB allocation (or maybe a handful of very large, ~100 MiB allocations concurrently). It does appear that the GC thinks your live heap is actually 856 MiB in size. If this was some kind of strange overrun bug, I would expect the following GC to realize that the vast majority of that memory isn't actually needed and drop it immediately. From that perspective, GOGC is completely working as intended. The jump appears to happen precisely because your application just needs that much more memory, period. The single large allocation would also explain why the heap doesn't grow smoothly and instead simply jumps.

As a result, I suspect there isn't actually anything actionable here. I'd recommend taking a look at a heap profile (the inuse_space sample_index) to confirm this, but I think I'm going to leave the WaitingForInfo label up so that the issue automatically closes out unless we can either rule this out, or we get some more corroborating evidence that this is actually a runtime issue.

MikeMitchellWebDev commented 5 months ago

april-27-gogc=3-memheap-pprof001 pprof001 I took a profile. Here are the inuse_space and alloc_space

mknyszek commented 5 months ago

That profile appears to have been taken at the end of program execution, since the inuse_space profile has very little in it. You'd need to acquire a profile when the heap actually grows.

MikeMitchellWebDev commented 5 months ago

inuse_profile_may_24 Here you go. I think it just confirms everything you've said thus far. The dramatic spike occurs at the beginning of the application.

gc 119 @3.304s 1%: 0.097+1.9+0.006 ms clock, 0.38+0.095/1.8/2.8+0.026 ms cpu, 350->351->343 MB, 352 MB goal, 0 MB stacks, 0 MB globals, 4 P gc 120 @3.362s 1%: 0.062+1.6+0.018 ms clock, 0.24+0.092/1.5/3.0+0.072 ms cpu, 350->352->343 MB, 353 MB goal, 0 MB stacks, 0 MB globals, 4 P gc 121 @3.419s 1%: 0.11+2.3+0.005 ms clock, 0.44+0.37/2.1/3.1+0.021 ms cpu, 351->351->343 MB, 354 MB goal, 0 MB stacks, 0 MB globals, 4 P gc 122 @3.451s 1%: 0.091+1.7+0.017 ms clock, 0.36+0.096/1.5/2.6+0.069 ms cpu, 859->859->853 MB, 859 MB goal, 0 MB stacks, 0 MB globals, 4 P gc 123 @8.516s 0%: 0.32+6.5+0.023 ms clock, 1.2+1.6/4.3/8.1+0.093 ms cpu, 875->876->856 MB, 879 MB goal, 0 MB stacks, 0 MB globals, 4 P gc 124 @8.744s 0%: 0.12+20+0.006 ms clock, 0.49+1.6/19/16+0.025 ms cpu, 876->876->857 MB, 881 MB goal, 0 MB stacks, 0 MB globals, 4 P

seankhliao commented 5 months ago

sounds like gc is working correctly.