Open bcmills opened 3 years ago
For most of plan9-arm failures, the cause is filesystem flakiness on the builder machines (a cluster of 3 x raspberry pi 4). I've just reconfigured the build script to stop using ramfs (which has a locking bug) for /tmp, and stop using the ad hoc directory cache (which still seems to have some flaws). This should bypass most of the failures but will make the test suite a lot slower to run (by go.17 it had already regressed to about 2 hours even with these filesystem tweaks).
There are some other flakes of the "shouldn't be possible" variety, which are are rare and perplexing and impossible to replicate. They are occurring in general go runtime code, not specific to plan 9. There's a slight whiff of possible memory cache coherence trouble in some of these: marked free object in span attempt to clean non-empty span set unexpected waitm - semaphore out of sync runtime: cannot allocate memory not inlined: unknown reason
It would be good to have a server-class ARM platform capable of running plan 9 as a more reliable builder. I don't know of one.
Looks like the filesystem change did resolve at least one class of failure: the '/boot/workdir/go/src/os' does not exit
failure mode hasn't occurred at all in November so far.
greplogs --dashboard -md -l -e \''/boot/workdir/go/src/os'\'' does not exist'
Change https://golang.org/cl/362975 mentions this issue: dashboard: omit the website repo on plan9
Many of the failure modes seem to have cleared up around 11 Nov.; it's not clear to me what fixed them.
One of the failures visible on the dashboard today is #49653; the other is the marked free object in span
failure mode.
A few more failure modes that remain:
unexpected signal
in os.StartProcess
)panic during panic
)unexpected stale targets
)Change https://golang.org/cl/369018 mentions this issue: src/cmd/go/internal/work: lock Builder output mutex consistently
Change https://golang.org/cl/380414 mentions this issue: message/pipeline: skip TestFullCycle on plan9-arm
A sampling from the past week or so. Undiagnosed failures in bold.
greplogs --dashboard -md -l -e \\Aplan9-arm -E . --since=2022-01-20
2022-01-27T05:30:27-be5769c-a991d9d/plan9-arm
panic: test timed out after 10m0s
…
goroutine 6 [select, 9 minutes]:
…
golang.org/x/tools/go/packages.goListDriver(0x1695ab84, {0x175c8c78, 0x1, 0x1})
/boot/workdir/gopath/src/golang.org/x/tools/go/packages/golist.go:200 +0x7b0
…
FAIL golang.org/x/tools/gopls/doc 600.515s
That's stuck running go list
:
https://cs.opensource.google/go/x/tools/+/master:go/packages/golist.go;l=447;drc=eb48d3f608bba06c3bb4f5627f9fc2562cc84dd2
2022-01-27T05:30:27-bbe1937-a991d9d/plan9-arm (#50857)
2022-01-27T05:30:27-aa10faf-a991d9d/plan9-arm (#50857)
2022-01-27T00:03:31-bbe1937-f4aa021/plan9-arm (#50857)
2022-01-26T23:43:39-be5769c-db48840/plan9-arm (probably #46520, not specific to plan9
)
2022-01-26T23:43:39-bbe1937-db48840/plan9-arm (#50857)
2022-01-26T22:33:26-ef0b09c/plan9-arm (possibly #22227?)
2022-01-26T22:09:36-bbe1937-ca6a5c0/plan9-arm (#50857)
2022-01-26T21:43:32-be5769c-c8b0dce/plan9-arm (#46520, not specific to plan9
)
2022-01-26T20:51:54-fe74b5f-c8b0dce/plan9-arm (#46520, not specific to plan9
)
2022-01-26T17:58:00-bbe1937-719e989/plan9-arm (#50857)
2022-01-25T22:56:45-c20fd7c-6eb58cd/plan9-arm (#46520, not specific to plan9
)
2022-01-25T22:04:10-c20fd7c-38729cf/plan9-arm (#46520, not specific to plan9
)
2022-01-25T00:39:08-97de9ec-16d6a52/plan9-arm (#46520, not specific to plan9
)
2022-01-24T21:27:20-cdd9e93/plan9-arm
--- FAIL: TestIntendedInlining (15.35s)
inl_test.go:267: exit status: 'go 32037: 2'
FAIL
FAIL cmd/compile/internal/test 119.720s
2022-01-24T16:42:11-2cc1836-19d819d/plan9-arm (#50775, mitigated) 2022-01-21T23:16:33-3c751cd-b7fa0f9/plan9-arm (#50775, mitigated) 2022-01-21T21:58:14-3c751cd-35b0db7/plan9-arm (#50775, mitigated) 2022-01-21T21:58:14-35b0db7/plan9-arm
panic: test timed out after 18m0s
…
goroutine 185 [chan receive, 17 minutes]:
…
os/exec.(*Cmd).CombinedOutput(0x110ae000)
/boot/workdir/go/src/os/exec/exec.go:567 +0x98 fp=0x11105e2c sp=0x11105e18 pc=0x12b1dc
cmd/compile/internal/test.TestCode(0x10c02f00)
/boot/workdir/go/src/cmd/compile/internal/test/ssa_test.go:171 +0x9fc fp=0x11105f98 sp=0x11105e2c pc=0x3af55c
…
FAIL cmd/compile/internal/test 1080.192s
That's running a go test -c
subprocess, but without a timeout so we don't get a useful goroutine dump to debug it 😞:
https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/test/ssa_test.go;l=171;drc=16a3cefc93d9b896b2053320e387d0e449904aba
2022-01-21T21:27:57-e7c9de2-9eba5ff/plan9-arm (#46520, not specific to plan9
)
2022-01-21T21:08:53-3425967-9eba5ff/plan9-arm (#50775, mitigated)
2022-01-21T16:59:19-9f83dd3-9eba5ff/plan9-arm (#50775, mitigated)
2022-01-20T20:23:52-80963bc-2c2e081/plan9-arm (#46520, not specific to plan9
)
2022-01-20T14:59:17-ab35822-9284279/plan9-arm (#46520, not specific to plan9
)
2022-01-20T14:52:25-ab35822-e7d5857/plan9-arm (possibly #45211? if so, not specific to plan9
)
Change https://go.dev/cl/408697 mentions this issue: dashboard: add known issues for plan9-arm
Sorry, but I can't find a watchflakes script at the start of the issue description. See https://go.dev/wiki/Watchflakes for details.
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
For most of plan9-arm failures, the cause is filesystem flakiness on the builder machines (a cluster of 3 x raspberry pi 4). I've just reconfigured the build script to stop using ramfs (which has a locking bug) for /tmp, and stop using the ad hoc directory cache (which still seems to have some flaws). This should bypass most of the failures but will make the test suite a lot slower to run (by go.17 it had already regressed to about 2 hours even with these filesystem tweaks).
I think that 9front has rewritten ramfs, and should not have locking bugs. I'd be interested in finding some reasonable arm machines to test on, and seeing if we can reproduce the flakiness on a recent 9front, without the slowdown. I can probably set up a MNT Reform with local NVMe as a builder for now, until we find some more suitable hardware to run on (Rockchip?)
How would I hook up a builder?
I think that 9front has rewritten ramfs, and should not have locking bugs
@oridb, the quote you're replying to is over a year old. The locking problem has been long fixed, and the plan9-arm builders are using ramfs for /tmp. I would welcome the use of some more solid builder hardware, as I suspect some of the odder flakes like those reported during GC are actually hardware glitches (marginal power supply, non-ECC memory).
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
Found new dashboard test flakes for:
#!watchflakes
default <- builder == "plan9-arm"
The one remaining
plan9
builder is failing a significant fraction of build attempts. Many of the failures follow specific known patterns (notably #49337, #46526, and #41952). However, many do not.@millerresearch, @0intro, @fhs: is there someone who can investigate these failures and bring the
plan9
port back up to par? With the0intro
builders missing (#49327, #49328), there is no longer anyplan9
builder consistently passing tests.greplogs --dashboard -md -l -e \\Aplan9. -E . --since=2021-10-01