Open rhysh opened 4 months ago
Related Issues and Documentation
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
My OS version is macOS Sonoma 14.4.1.
The issue is also present with go1.17.13.
Wasn't there a bug in Darwin that CPU sampling is biased toward syscalls? That might explain this. #57722 maybe?
Yes, #57722 was on my mind too.
What I see here is overcounting, which is more extreme than "bias". Maybe in small doses, the overcounting appears as if it were (only) bias?
This seems like potentially an OS bug. Does it reproduce with a 4 thread C program?
If not trivially, I wonder if it is related to threads sleeping and waking (in lock2 presumably). Maybe each thread wake-up is erroneously receiving a SIGPROF?
OS bug seems likely, yes. I agree that a C reproducer would be a good next step. I have the next few weeks planned out with other work — thanks in advance if someone beats me to it — otherwise I'll take a stab at that later.
Yes, syscalls are related: In lock_sema.go, lock2
does a progression of 1/ procyield
, 2/ osyield
, 3/ semasleep
. Steps 2 and 3 involve syscalls. If I remove either step 2 or 3, the problem persists. If I remove both of those, so lock2
's only form of delay is procyield
(no syscalls), then the magnitude of the CPU profile stays within the believable range.
I've updated to macOS Sonoma 14.5 and the issue is still present.
Go version
Output of
go env
in your module/workspace:What did you do?
On an M1 Macbook Air, which I understand to have 8 cores of 1 thread each (4 performance, 4 efficiency), I ran the runtime's ChanContended benchmark while varying GOMAXPROCS, collecting a CPU profile for each.
What did you see happen?
The profile reports using an average of 0.62 threads of on-CPU time when GOMAXPROCS is 1, and 1.21 when GOMAXPROCS is 2.
However, when GOMAXPROCS is 4 and 8, it reports using 125.54 and 403.14 threads of on-CPU time respectively. It claims that the program consumed 930 seconds of on-CPU time over a period of 2.31 seconds of wall-clock time.
That's more than 50 times more hardware threads than the machine really has.
What did you expect to see?
I expected the profile to report no more than 100 CPU profile samples per thread-second.
This may be related to #57722 . In that issue, GOOS=darwin CPU profiles included the right number of samples but skewed them towards a particular portion of the workload. But here, the workload itself is experiencing extreme over-sampling.
A combined execution trace and CPU profile show a change in the density of the magenta "CPU profile sample" annotations at the time when the test transitions from GOMAXPROCS=2 to GOMAXPROCS=3. (It's like it's the opposite of #35057 , where GOOS=linux profiles under-sampled when the program's CPU usage crossed the 250% boundary.)