Open bcmills opened 2 years ago
The loading compiled Go files from cache
error string is a hapax legomenon in the Go project; it definitely comes from here, in cmd/go/internal/work
:
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/exec.go;l=739;drc=2580d0e08d5e9f979b943758d3c49877fb2324cb
The reading srcfiles list
comes from here:
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/exec.go;l=1006;drc=master
The error appears to indicate file corruption in the cmd/go
build cache, but I don't have any theories as to how that corruption is occurring or why it seems to only affect this one test on this one builder, and the test understandably doesn't provide much detail on the sequence or timing of the go
invocations it is running.
Given the failure mode, I think the bug is more likely in os
, syscall
, or cmd/go
itself than in x/tools/cmd/callgraph
. windows/arm64
is not a first-class port and lacks a longtest
builder, so it may be that x/tools/cmd/callgraph
is incidentally triggering an underlying bug in an interaction that is being skipped (or isn't covered at all) in the os
and/or cmd/go
tests.
We are also running a relatively old Windows 10 build (#48946, CC @golang/release, @zx2c4), so I can't rule out a bug in the underlying platform either.
This is a release-blocker via #11811, but given that this is not a first-class port and appears to be a platform-specific bug affecting only one test, I plan to add a test skip for this specific builder in x/tools/cmd/callgraph.TestCallgraph
and then move this issue to the Backlog without investigating further.
If we also observe this failure mode on the new windows-arm64-11
builder once that is up and running, and/or if we upgrade windows/arm64
to a first-class port, we can reprioritize an investigation.
Change https://golang.org/cl/379734 mentions this issue: cmd/callgraph: skip TestCallgraph on the windows-arm64-10 builder
The 'bad checksum' means we read a file that was named for a sha256 hash and the content did not match that sha256.
The fact that this is only windows/arm64 and that we've seen absolutely no mentions of it on other systems or in other bug reports makes me feel okay with this not being a release-blocker. If there really is corruption, the content-addressed and checksum-checked nature of the cache means that the system is either failstop or works correctly. So far we are getting no reports of failstop other than this one.
I agree, but the failure rate for TestCallgraph
is high enough that I think we should at least add that skip in the interim to avoid masking other failures on the builders.
t.Skips are always OK in my book.
Change https://golang.org/cl/381314 mentions this issue: cmd/go: cache debugging
I uploaded https://go-review.googlesource.com/c/go/+/381314 just to have around if we need to patch it in to investigate this further. No intent to submit it.
If we also observe this failure mode on the new
windows-arm64-11
builder once that is up and running, and/or if we upgradewindows/arm64
to a first-class port, we can reprioritize an investigation.
Now observed on windows-arm64-11
as well:
greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2022-01-20
A couple more. Whatever the cause, this is not fixed in Windows 11.
greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2022-02-24
2022-04-01T20:25:27-153e30b-32ff9b5/windows-arm64-11 2022-04-01T17:19:22-cda13e2-df89f2b/windows-arm64-11
Change https://go.dev/cl/397996 mentions this issue: cmd/callgraph: expand windows/arm64 skip to the whole platform
@qmuntal, this is one of the issues I think should block promoting windows/arm64
to a first class port.
@bcmills which is the ~occurrence rate of this issue?
I've been running TestCallgraph
for a couple of days on a windows/arm64
VM with plenty of capacity and haven't triggered the issue. I'm using a Go and x/tools
version that reproduced this issue in the past inside the official builder.
Will keep it running a couple more days, but I'm leaning towards a HW capacity issue related to #51019.
@qmuntal, before we started skipping TestCallgraph
on arm64
the failure rate was about one per ~month of development. It's a bit tricky to estimate the failure rate per run from that, but a rough back-of-the-envelope is 1 failure / month / (~30 commits/day * 30 days/month), or a failure rate of about 0.1% of commits.
It does seem plausible that this could be a defect (or a bad interaction with a platform bug) somewhere in the virtualization stack used to host the builder.
I haven't been able to reproduce this failure locally yet. Let's unskip TestCallgraph
and see if it still fails in the new Azure builders, which doesn't use qemu.
greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2021-01-01
2022-01-19T20:29:36-7c251d6-9de1ac6/windows-arm64-10 2021-11-02T15:54:27-058ed05-c3cb1ec/windows-arm64-10 2021-11-01T13:50:47-513e3fb-4a84298/windows-arm64-10 2021-10-14T17:38:39-e69ba9d-011fd00/windows-arm64-10 2021-09-14T02:53:17-384e5da-ee91bb8/windows-arm64-10