golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.5k stars 17.6k forks source link

runtime/pprof: TestCPUProfileMultithreadMagnitude failure due to usage too high on linux-arm-aws #53785

Open bcmills opened 2 years ago

bcmills commented 2 years ago

2022-07-06T19:34:57-2f43de6/linux-arm-aws:

--- FAIL: TestCPUProfileMultithreadMagnitude (0.42s)
    pprof_test.go:123: Running on Linux 4.19.0
    --- FAIL: TestCPUProfileMultithreadMagnitude/serial (0.20s)
        pprof_test.go:189: Running with 1 workers
        pprof_test.go:524: total 9 CPU profile samples collected:
            3: 0x15609c (runtime/pprof.cpuHog0:61 runtime/pprof.cpuHog1:55) 0x155fe3 (runtime/pprof.cpuHogger:41) 0x157207 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func3.1.1.1:202) labels: map[]

            6: 0x1560a8 (runtime/pprof.cpuHog0:64 runtime/pprof.cpuHog1:55) 0x155fe3 (runtime/pprof.cpuHogger:41) 0x157207 (runtime/pprof.TestCPUProfileMultithreadMagnitude.func3.1.1.1:202) labels: map[]

        pprof_test.go:595: runtime/pprof.cpuHog1: 9
        pprof_test.go:226: compare 154.991ms vs 90ms
        pprof_test.go:228: compare got CPU usage reports are too different (limit -40.0%, got -41.9%) want nil
    pprof_test.go:126: Failure of this test may indicate that your system suffers from a known Linux kernel bug fixed on newer kernels. See https://golang.org/issue/49065.
FAIL
FAIL    runtime/pprof   7.677s

greplogs -l -e 'FAIL: TestCPUProfileMultithreadMagnitude' --since=2022-03-23 2022-07-06T19:34:57-2f43de6/linux-arm-aws

See previously #50097 (attn @prattmic; CC @golang/runtime).

prattmic commented 2 years ago

I believe this is a case of #49065. That bug is not x86-specific, however I missed checking whether our ARM builders had updated kernels.

prattmic commented 2 years ago

@golang/release I don't quite follow what is going on in https://cs.opensource.google/go/x/build/+/master:env/linux-arm64/aws/, but I get a sense that if we regenerate the AWS image that it will pick up a newer Debian base image which (presumably) has a newer kernel package including the fix for this issue. Does that sound correct?

bcmills commented 2 years ago

Oh! Maybe we just need to remove or widen the GOARCH condition at https://cs.opensource.google/go/go/+/master:src/runtime/pprof/pprof_test.go;l=133;drc=6ec46f470797ad816c3a5b20eece0995f13d2bc4 ?

prattmic commented 2 years ago

Oops, I didn't look closely enough:

So this should be different after all.

dmitshur commented 2 years ago

@prattmic What you said about the x/build/env/linux-arm64/aws/ directory sounds plausible to me, but it also seems possible that the Docker image reuses the host's kernel, and if so then VMImage may be where an update would need to happen to pick up a newer kernel version. Someone else may know more.

rhysh commented 2 years ago

This failure in 2022-07-06T19:34:57-2f43de6/linux-arm-aws is on release-branch.go1.18.

I think this the same sort of "short test duration means small sample size means moderate chance of failure when we get unlucky" as we saw in #50232. I fixed that in https://go.dev/cl/393934, "runtime/pprof: rerun magnitude test on failure", but that isn't backported to Go 1.18.

Should we backport that fix to Go 1.18, or live with the noise until it's EOL?