Open cherrymui opened 8 months ago
Found new dashboard test flakes for:
#!watchflakes
post <- builder == "gotip-linux-ppc64le" && (`test timed out` || `SIGQUIT` || `context deadline exceeded`)
I wonder if LUCI needs to set GO_TEST_TIMEOUT_SCALE=2
for ppc64{,le}? I don't think I see that set in the logs. The POWER8 builders can be particularly slow. I think the VMs are also subject to sharing CPU time with whatever else is running at OSU.
Yeah, GO_TEST_TIMEOUT_SCALE=2
doesn't seem to be set. The old ppc64le builders set GO_TEST_TIMEOUT_SCALE=2
, so we probably should do that for LCUI builders as well.
cc @mknyszek
I'll send a CL for this today. Thanks.
Change https://go.dev/cl/557857 mentions this issue: main.star: add a host-based test timeout scale
Found new dashboard test flakes for:
#!watchflakes
post <- builder == "gotip-linux-ppc64le" && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || status == "ABORT")
Closing. Should be fixed by the CL above. (Apparently gopherbot doesn't do it because it is checked in to a branch.)
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64le` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || status == "ABORT")
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64le` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || status == "ABORT")
Is there more context to be found about what test is suspected of hanging, and what the timeout is?
I am rebuilding the container images to include lsof.
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64le` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || status == "ABORT")
Hrm, I wonder if this is timing out more frequently because the jobs are not running on a tmpfs. The only tmpfs I have mounted to the containers is /workdir (as was used by the old CI), it doesn't look to be used.
Is there an option to move the work/caches to a tmpfs? My initial though is to mount /home/swarming as a tmpfs.
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64le` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT")
Change https://go.dev/cl/563396 mentions this issue: main.star: apply timeout scale to base builders, scale netbsd-arm64
Reports above seem to be because the GO_TEST_TIMEOUT_SCALE=2
env var wasn't fully propagated to the environment of one of the two builders. That should be fixed for future builds (CL 563396). Closing again optimistically, watchflakes will reopen if there turns out to be more to do after that.
I've also changed the docker configuration to mount /home/swarming as a tmpfs.
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT")
The latest batch of failures is the same issue as #65725. I only adjusted the RAM limits on the ppc64le LUCI containers. Looking through the logs, the ppc64-power10 LUCI containers also need more RAM.
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT")
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT")
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT")
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT")
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT")
The net/http failures should be attributed to #67382, notably they have not been mentioned on the other issue. Reclosing this issue.
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT") && test != "TestServerReadAfterWriteHeader100Continue/h1"
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT") && test != "TestServerReadAfterWriteHeader100Continue/h1"
I inspected ppc64-power10 failure directly before this message. It shows many failures, but not those which caused the timeout. Inspecting the go testing log, I see the timing out tests mentioned by gopherbot. Is that expected behavior?
Found new dashboard test flakes for:
#!watchflakes
post <- builder ~ `(gotip|go1\.\d\d)-linux-ppc64` && (`test timed out` || `SIGQUIT` || `context deadline exceeded` || `running too slowly` || status == "ABORT") && test != "TestServerReadAfterWriteHeader100Continue/h1"