golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.08k stars 17.68k forks source link

runtime/pprof: GOOS=darwin over-reports CPU usage in mutex-heavy workload #68477

Open rhysh opened 4 months ago

rhysh commented 4 months ago

Go version

$ ~/sdk/go1.23/bin/go version
go version go1.23rc2 darwin/arm64

Output of go env in your module/workspace:

$ ~/sdk/go1.23/bin/go env -changed
$

What did you do?

On an M1 Macbook Air, which I understand to have 8 cores of 1 thread each (4 performance, 4 efficiency), I ran the runtime's ChanContended benchmark while varying GOMAXPROCS, collecting a CPU profile for each.

$ for i in 1 2 4 8; do ~/sdk/go1.23/bin/go test runtime -run=^$ -bench=ChanContended -cpu=$i -cpuprofile=/tmp/cpu$i; done

What did you see happen?

The profile reports using an average of 0.62 threads of on-CPU time when GOMAXPROCS is 1, and 1.21 when GOMAXPROCS is 2.

However, when GOMAXPROCS is 4 and 8, it reports using 125.54 and 403.14 threads of on-CPU time respectively. It claims that the program consumed 930 seconds of on-CPU time over a period of 2.31 seconds of wall-clock time.

That's more than 50 times more hardware threads than the machine really has.

What did you expect to see?

I expected the profile to report no more than 100 CPU profile samples per thread-second.


This may be related to #57722 . In that issue, GOOS=darwin CPU profiles included the right number of samples but skewed them towards a particular portion of the workload. But here, the workload itself is experiencing extreme over-sampling.

$ for i in 1 2 4 8; do ~/sdk/go1.23/bin/go test runtime -run=^$ -bench=ChanContended -cpu=$i -cpuprofile=/tmp/cpu$i; done
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended    446238          2638 ns/op
PASS
ok      runtime 2.113s
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended-2      354793          3331 ns/op
PASS
ok      runtime 2.241s
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended-4      231181          5051 ns/op
PASS
ok      runtime 1.560s
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended-8      202797          5902 ns/op
PASS
ok      runtime 2.468s

$ for i in 1 2 4 8; do ~/sdk/go1.23/bin/go tool pprof /tmp/cpu$i </dev/null 2>&1 | grep Duration; done
Duration: 1.95s, Total samples = 1.21s (62.01%)
Duration: 2.09s, Total samples = 2.53s (121.18%)
Duration: 1.40s, Total samples = 175.74s (12553.94%)
Duration: 2.31s, Total samples = 929.99s (40314.39%)

A combined execution trace and CPU profile show a change in the density of the magenta "CPU profile sample" annotations at the time when the test transitions from GOMAXPROCS=2 to GOMAXPROCS=3. (It's like it's the opposite of #35057 , where GOOS=linux profiles under-sampled when the program's CPU usage crossed the 250% boundary.)

$ ~/sdk/go1.23/bin/go test runtime -run=^$ -bench=ChanContended -cpu=1,2,3,4,5,6,7,8 -cpuprofile=/tmp/cpu -trace=/tmp/cct
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
BenchmarkChanContended        447841          2642 ns/op
BenchmarkChanContended-2      354488          3363 ns/op
BenchmarkChanContended-3      265177          4586 ns/op
BenchmarkChanContended-4      237844          5083 ns/op
BenchmarkChanContended-5      215295          5618 ns/op
BenchmarkChanContended-6      207518          5817 ns/op
BenchmarkChanContended-7      209977          5872 ns/op
BenchmarkChanContended-8      206673          5850 ns/op
PASS
ok      runtime 11.268s

$ ~/sdk/go1.23/bin/go tool trace /tmp/cct
Screenshot 2024-07-16 at 4 20 20 PM
gabyhelp commented 4 months ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

rhysh commented 4 months ago

My OS version is macOS Sonoma 14.4.1.

The issue is also present with go1.17.13.

mknyszek commented 3 months ago

Wasn't there a bug in Darwin that CPU sampling is biased toward syscalls? That might explain this. #57722 maybe?

rhysh commented 3 months ago

Yes, #57722 was on my mind too.

What I see here is overcounting, which is more extreme than "bias". Maybe in small doses, the overcounting appears as if it were (only) bias?

prattmic commented 3 months ago

This seems like potentially an OS bug. Does it reproduce with a 4 thread C program?

If not trivially, I wonder if it is related to threads sleeping and waking (in lock2 presumably). Maybe each thread wake-up is erroneously receiving a SIGPROF?

rhysh commented 3 months ago

OS bug seems likely, yes. I agree that a C reproducer would be a good next step. I have the next few weeks planned out with other work — thanks in advance if someone beats me to it — otherwise I'll take a stab at that later.

Yes, syscalls are related: In lock_sema.go, lock2 does a progression of 1/ procyield, 2/ osyield, 3/ semasleep. Steps 2 and 3 involve syscalls. If I remove either step 2 or 3, the problem persists. If I remove both of those, so lock2's only form of delay is procyield (no syscalls), then the magnitude of the CPU profile stays within the believable range.

I've updated to macOS Sonoma 14.5 and the issue is still present.