Open bcmills opened 3 years ago
"Bryan C. Mills" @.***> writes:
The
solaris-amd64-oraclerel
builder seems to be failing moderately frequently withno space left on device
errors: [...] It's not obvious to me whether the buildlet script is failing to clean something up, the device's disk is getting too full for other reasons, or perhaps the builder is just configured to run too many builds in parallel.
I've looked around and there may be several issues:
In every one of the failing builds, /tmp was full. It's ca. 40 GB no the build host, but resides in tmpfs, thus shares space with swap.
While golang left ca. 931 MB around in /tmp/workdir-host-solaris-oracle-amd64-oraclerel (almost half of that in go/pkg/obj/go-build) even after the build service was stopped, there still was plenty of free space left.
I don't the parallelism is too high: the golang buildlets uses 4 cores max, and an llvm buildbot running on the same host another 8, while the host has 24 cores.
Given all this, I suspect (but this is just a hunch) that some llvm testcase either exhausts /tmp (unlikely given that the llvm tmp files seem to reside in /var/tmp exclusively) or VM/swap (way more likely: I had runaway llvm testcases like this in the past), and if they are as lazy with resource control as they are with cleaning up tmp files (350k files in /var/tmp), this seems to be the most plausible cause.
For remedy, there are several options:
Increasing either/or RAM or swap.
Limit the VM consumption of the services.
I'll look into either of those.
This has started occurring intermittently again.
greplogs --dashboard -md -l -e '(?ms)\Asolaris-amd64-oraclerel.* no space left on device' --since=2021-03-26
2022-04-23T05:38:56-9717e8f/solaris-amd64-oraclerel 2022-04-19T17:05:22-4804c43-689dc17/solaris-amd64-oraclerel [note 11-month gap!] 2021-05-24T20:15:56-15d9d4a/solaris-amd64-oraclerel 2021-05-10T18:10:43-ecb7392-73d5aef/solaris-amd64-oraclerel 2021-05-03T16:42:22-169155d/solaris-amd64-oraclerel 2021-04-28T19:13:50-ad989c7/solaris-amd64-oraclerel 2021-04-26T21:27:41-9f60169/solaris-amd64-oraclerel 2021-04-08T21:58:35-0243799-d67e739/solaris-amd64-oraclerel 2021-04-08T07:33:58-b261fe9-a7e16ab/solaris-amd64-oraclerel 2021-03-31T14:26:53-4fbd30e-2940614/solaris-amd64-oraclerel
"Bryan C. Mills" @.***> writes:
This has started occurring intermittently again.
greplogs --dashboard -md -l -e '(?ms)\Asolaris-amd64-oraclerel.* no space left on device' --since=2021-03-26
2022-04-23T05:38:56-9717e8f/solaris-amd64-oraclerel 2022-04-19T17:05:22-4804c43-689dc17/solaris-amd64-oraclerel [note 11-month gap!] [...]
I was recently forced to migrate the zone hosting the builder to a different machine. In the process, swap was inadvertently reduced from 32 GB to 4 GB. With WORKDIR residing in /tmp (tmpfs), VM shortage could lead to those errors.
I've now restored the previous swap size, which should make the problem vanish, like it did for the last year.
greplogs -l -e '(?ms)\Asolaris-amd64-oraclerel.* no space left on device' --since=2022-04-24
2022-05-03T19:58:15-7c404d5/solaris-amd64-oraclerel
2022-04-25T15:49:44-12763d1/solaris-amd64-oraclerel
The
solaris-amd64-oraclerel
builder seems to be failing moderately frequently withno space left on device
errors:2021-05-24T20:15:56-15d9d4a/solaris-amd64-oraclerel 2021-05-10T15:11:50-ecb7392/solaris-amd64-oraclerel 2021-05-03T16:42:22-169155d/solaris-amd64-oraclerel 2021-04-28T19:13:50-ad989c7/solaris-amd64-oraclerel 2021-04-26T21:27:41-9f60169/solaris-amd64-oraclerel 2021-04-08T20:55:59-0243799/solaris-amd64-oraclerel 2021-04-08T02:08:45-b261fe9/solaris-amd64-oraclerel 2021-03-30T21:06:17-4fbd30e/solaris-amd64-oraclerel
It's not obvious to me whether the buildlet script is failing to clean something up, the device's disk is getting too full for other reasons, or perhaps the builder is just configured to run too many builds in parallel.
CC @golang/release @rorth