golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123k stars 17.54k forks source link

cmd/go,os: build cache checksum errors in x/tools/cmd/callgraph.TestCallgraph on windows/arm64 #50706

Open bcmills opened 2 years ago

bcmills commented 2 years ago
--- FAIL: TestCallgraph (11.99s)
    main_test.go:85: err: exit status 1: stderr: go build pkg: loading compiled Go files from cache: reading srcfiles list: cache entry not found: bad checksum

    main_test.go:100: got:
         <root> --> pkg.init
        pkg.main2 --> (pkg.D).f
        pkg.main --> pkg.main2
        pkg.main --> (pkg.C).f
        <root> --> pkg.main

    main_test.go:100: got:
         (*os.File).setDeadline --> (*os.File).checkValid
        (time.Time).IsZero --> (*time.Time).sec
        (time.Time).IsZero --> (*time.Time).nsec
        internal/poll.setDeadlineImpl --> (time.Time).IsZero
…
FAIL
FAIL    golang.org/x/tools/cmd/callgraph    12.869s

greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2021-01-01

2022-01-19T20:29:36-7c251d6-9de1ac6/windows-arm64-10 2021-11-02T15:54:27-058ed05-c3cb1ec/windows-arm64-10 2021-11-01T13:50:47-513e3fb-4a84298/windows-arm64-10 2021-10-14T17:38:39-e69ba9d-011fd00/windows-arm64-10 2021-09-14T02:53:17-384e5da-ee91bb8/windows-arm64-10

bcmills commented 2 years ago

The loading compiled Go files from cache error string is a hapax legomenon in the Go project; it definitely comes from here, in cmd/go/internal/work: https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/exec.go;l=739;drc=2580d0e08d5e9f979b943758d3c49877fb2324cb The reading srcfiles list comes from here: https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/exec.go;l=1006;drc=master

The error appears to indicate file corruption in the cmd/go build cache, but I don't have any theories as to how that corruption is occurring or why it seems to only affect this one test on this one builder, and the test understandably doesn't provide much detail on the sequence or timing of the go invocations it is running.

Given the failure mode, I think the bug is more likely in os, syscall, or cmd/go itself than in x/tools/cmd/callgraph. windows/arm64 is not a first-class port and lacks a longtest builder, so it may be that x/tools/cmd/callgraph is incidentally triggering an underlying bug in an interaction that is being skipped (or isn't covered at all) in the os and/or cmd/go tests.

We are also running a relatively old Windows 10 build (#48946, CC @golang/release, @zx2c4), so I can't rule out a bug in the underlying platform either.

bcmills commented 2 years ago

This is a release-blocker via #11811, but given that this is not a first-class port and appears to be a platform-specific bug affecting only one test, I plan to add a test skip for this specific builder in x/tools/cmd/callgraph.TestCallgraph and then move this issue to the Backlog without investigating further.

If we also observe this failure mode on the new windows-arm64-11 builder once that is up and running, and/or if we upgrade windows/arm64 to a first-class port, we can reprioritize an investigation.

gopherbot commented 2 years ago

Change https://golang.org/cl/379734 mentions this issue: cmd/callgraph: skip TestCallgraph on the windows-arm64-10 builder

rsc commented 2 years ago

The 'bad checksum' means we read a file that was named for a sha256 hash and the content did not match that sha256.

rsc commented 2 years ago

The fact that this is only windows/arm64 and that we've seen absolutely no mentions of it on other systems or in other bug reports makes me feel okay with this not being a release-blocker. If there really is corruption, the content-addressed and checksum-checked nature of the cache means that the system is either failstop or works correctly. So far we are getting no reports of failstop other than this one.

bcmills commented 2 years ago

I agree, but the failure rate for TestCallgraph is high enough that I think we should at least add that skip in the interim to avoid masking other failures on the builders.

rsc commented 2 years ago

t.Skips are always OK in my book.

gopherbot commented 2 years ago

Change https://golang.org/cl/381314 mentions this issue: cmd/go: cache debugging

rsc commented 2 years ago

I uploaded https://go-review.googlesource.com/c/go/+/381314 just to have around if we need to patch it in to investigate this further. No intent to submit it.

bcmills commented 2 years ago

If we also observe this failure mode on the new windows-arm64-11 builder once that is up and running, and/or if we upgrade windows/arm64 to a first-class port, we can reprioritize an investigation.

Now observed on windows-arm64-11 as well:

greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2022-01-20

2022-02-17T17:36:57-cda4201-eaf0405/windows-arm64-11

bcmills commented 2 years ago

A couple more. Whatever the cause, this is not fixed in Windows 11.

greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2022-02-24

2022-04-01T20:25:27-153e30b-32ff9b5/windows-arm64-11 2022-04-01T17:19:22-cda13e2-df89f2b/windows-arm64-11

gopherbot commented 2 years ago

Change https://go.dev/cl/397996 mentions this issue: cmd/callgraph: expand windows/arm64 skip to the whole platform

bcmills commented 2 years ago

@qmuntal, this is one of the issues I think should block promoting windows/arm64 to a first class port.

qmuntal commented 2 years ago

@bcmills which is the ~occurrence rate of this issue?

I've been running TestCallgraph for a couple of days on a windows/arm64 VM with plenty of capacity and haven't triggered the issue. I'm using a Go and x/tools version that reproduced this issue in the past inside the official builder.

Will keep it running a couple more days, but I'm leaning towards a HW capacity issue related to #51019.

bcmills commented 2 years ago

@qmuntal, before we started skipping TestCallgraph on arm64 the failure rate was about one per ~month of development. It's a bit tricky to estimate the failure rate per run from that, but a rough back-of-the-envelope is 1 failure / month / (~30 commits/day * 30 days/month), or a failure rate of about 0.1% of commits.

It does seem plausible that this could be a defect (or a bad interaction with a platform bug) somewhere in the virtualization stack used to host the builder.

qmuntal commented 7 months ago

I haven't been able to reproduce this failure locally yet. Let's unskip TestCallgraph and see if it still fails in the new Azure builders, which doesn't use qemu.