nodejs / build

Better build and test infra for Node.
506 stars 166 forks source link

/tmp filling up on CI hosts #3864

Open targos opened 3 months ago

targos commented 3 months ago

https://ci.nodejs.org/job/node-test-commit-smartos/buildTimeTrend

It's slowing down core development.

richardlau commented 3 months ago

/tmp on test-equinix_mnx-smartos20-x64-3 is full:

$ ansible -m command -a "df -h" test-equinix_mnx-smartos20-x64-?
test-equinix_mnx-smartos20-x64-3 | CHANGED | rc=0 >>
Filesystem                                  Size  Used Avail Use% Mounted on
zones/356655a2-12e6-e1d7-ac7b-b5188ad37cb0  100G   30G   71G  30% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  4.6G   12G  29% /etc/svc/volatile
swap                                        4.0G  4.0G     0 100% /tmp
swap                                         16G  4.6G   12G  29% /var/run
test-equinix_mnx-smartos20-x64-4 | CHANGED | rc=0 >>
Filesystem                                  Size  Used Avail Use% Mounted on
zones/c6e3d47a-1421-ee11-c52d-c3c80c198e95  100G   25G   76G  25% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  2.3G   14G  15% /etc/svc/volatile
swap                                        4.0G  1.8G  2.3G  45% /tmp
swap                                         16G  2.3G   14G  15% /var/run

It's full of node-coverage-* directories. I've removed them:

[root@test-equinix_mnx-smartos20-x64-3 ~]# rm -rf /tmp/node-coverage-*
[root@test-equinix_mnx-smartos20-x64-3 ~]# df -h
Filesystem                                  Size  Used Avail Use% Mounted on
zones/356655a2-12e6-e1d7-ac7b-b5188ad37cb0  100G   30G   71G  30% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  955M   16G   6% /etc/svc/volatile
swap                                        4.0G   44K  4.0G   1% /tmp
swap                                         16G  955M   16G   6% /var/run
[root@test-equinix_mnx-smartos20-x64-3 ~]#
richardlau commented 3 months ago

I've cleared out /tmp/node-coverage-* on test-equinix_mnx-smartos20-x64-4 as well:

[root@test-equinix_mnx-smartos20-x64-4 ~]# df -h
Filesystem                                  Size  Used Avail Use% Mounted on
zones/c6e3d47a-1421-ee11-c52d-c3c80c198e95  100G   25G   76G  25% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  2.3G   14G  15% /etc/svc/volatile
swap                                        4.0G  1.8G  2.3G  45% /tmp
swap                                         16G  2.3G   14G  15% /var/run
[root@test-equinix_mnx-smartos20-x64-4 ~]# rm -rf /tmp/node-coverage-*
[root@test-equinix_mnx-smartos20-x64-4 ~]# df -h
Filesystem                                  Size  Used Avail Use% Mounted on
zones/c6e3d47a-1421-ee11-c52d-c3c80c198e95  100G   25G   76G  25% /
/.zonecontrol                                56T   57G   56T   1% /.zonecontrol
/lib                                        314M  292M   23M  93% /lib
/lib/svc/manifest                            56T  1.5M   56T   1% /lib/svc/manifest
/usr                                        433M  394M   40M  91% /usr
swap                                         16G  467M   16G   3% /etc/svc/volatile
swap                                        4.0G   44K  4.0G   1% /tmp
swap                                         16G  467M   16G   3% /var/run
[root@test-equinix_mnx-smartos20-x64-4 ~]#
targos commented 3 months ago

Good catch! So there's a bug in our tests. They shouldn't be using the system's tmpdir.

richardlau commented 3 months ago

Good catch! So there's a bug in our tests. They shouldn't be using the system's tmpdir.

FWIW I ran make -j88 test-ci on a dev machine I have access to and it created these directories/files in /tmp:

drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-09AcaL
drwx------.  2 rlau   rlau      138 Aug 14 13:48 node-coverage-6qYENJ
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-afMxbz
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-e0jO5f
drwx------.  2 rlau   rlau      138 Aug 14 13:48 node-coverage-F6ZUIw
drwx------.  2 rlau   rlau       50 Aug 14 13:48 node-coverage-hHzLhL
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-Lc7I1n
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-otzELS
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-p2PeI0
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-q6CVl4
drwx------.  2 rlau   rlau       50 Aug 14 13:48 node-coverage-RdCe2w
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-SnPBDM
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-tw4y03
drwx------.  2 rlau   rlau       94 Aug 14 13:48 node-coverage-x0GveC
drwx------.  2 rlau   rlau       50 Aug 14 13:48 node-coverage-zBbuSF
-rw-r--r--.  1 rlau   rlau   138954 Aug 14 13:47 perf-426641.map
-rw-r--r--.  1 rlau   rlau   112792 Aug 14 13:47 perf-426677.map
richardlau commented 3 months ago

I think some of those are from test/parallel/test-runner-coverage.js which has specific tests checking coverage is produced when the NODE_V8_COVERAGE environment variable is not set.

richardlau commented 3 months ago

I think some of those are from test/parallel/test-runner-coverage.js which has specific tests checking coverage is produced when the NODE_V8_COVERAGE environment variable is not set.

If I exclude test/parallel/test-runner-coverage.js from parallel there still at least one more test writing out coverage:

$ rm -rf /tmp/node-coverage-* /tmp/perf-*.map
$ tools/test.py -J --skip=parallel/test-runner-coverage parallel
[00:28|% 100|+ 3508|-   0]: Done

All tests passed.
$ ls -al /tmp
...
drwx------.  2 rlau   rlau       51 Aug 14 17:20 node-coverage-8Oj2sa
drwx------.  2 rlau   rlau       96 Aug 14 17:20 node-coverage-Xwd98L
-rw-r--r--.  1 rlau   rlau   138954 Aug 14 17:20 perf-1822721.map
-rw-r--r--.  1 rlau   rlau   112792 Aug 14 17:20 perf-1822738.map
...
$
targos commented 3 months ago

Can you try to change the permissions so only root is allowed to write to /tmp. Hopefully we can identify the tests by observing failures.

richardlau commented 3 months ago

Not on that machine, but should be able to in a container.

richardlau commented 3 months ago
$ mkdir /tmp/cantwrite
$ chmod u-w /tmp/cantwrite/
$ ls -al /tmp/cantwrite
total 4
dr-xr-xr-x.  2 rlau   rlau      6 Aug 14 19:15 .
drwxrwxrwt. 12 nobody nobody 4096 Aug 14 19:15 ..
$ TMPDIR=/tmp/cantwrite/ tools/test.py
...
[03:29|% 100|+ 4014|-   2]: Done

Failed tests:
out/Release/node /home/rlau/sandbox/github/node/test/parallel/test-runner-output.mjs
out/Release/node /home/rlau/sandbox/github/node/test/parallel/test-runner-coverage.js
$
richardlau commented 3 months ago

With an actual unwritable /tmp:

[03:26|% 100|+ 4012|-   4]: Done

Failed tests:
out/Release/node /home/nodejs/node/test/parallel/test-cli-node-options.js
out/Release/node /home/nodejs/node/test/parallel/test-runner-coverage.js
out/Release/node /home/nodejs/node/test/parallel/test-runner-output.mjs
out/Release/node /home/nodejs/node/test/parallel/test-trace-events-http.js
$
targos commented 3 months ago
jakecastelli commented 3 months ago

For test-runner-coverage.js - When NODE_V8_COVERAGE is set the report will be written into the dest directory, currently for test-runner-coverage.js is set to tmpdir.path maybe what we need to do is to clear the report directory after the test is finished to prevent the report fill up the disk space, will take another when I wake up.

richardlau commented 3 months ago

For test-runner-coverage.js - When NODE_V8_COVERAGE is set the report will be written into the dest directory, currently for test-runner-coverage.js is set to tmpdir.path maybe what we need to do is to clear the report directory after the test is finished to prevent the report fill up the disk space, will take another when I wake up.

I think those cases are working as expected -- but that test has variants that deliberately do not set NODE_V8_COVERAGE and those are causing writes to the system tmp dir which are never cleaned up.

jakecastelli commented 2 months ago

Thanks Richard, you are right about this, when I commented out the test that has NODE_V8_COVERAGE, it still writes to the temp dir and not being cleaned up. I will take some time to investigate why this happened.

jakecastelli commented 2 months ago

I spent some time to further investigate and found out whether or not NODE_V8_COVERAGE is set or not, it will always write to the system temp dir and then if NODE_V8_COVERAGE is set, it will copy the coverage report across to that dir. (see here for the implementation detail)

^ this might explain why when the write access is disabled to the temp dir the test failed immediately.

I have a question - after we've done the copying, shouldn't we remove the temp created dir? @cjihrig

cjihrig commented 2 months ago

after we've done the copying, shouldn't we remove the temp created dir?

We could. I wouldn't expect to need to in the temp directory though. The test could also clean up those files.

jakecastelli commented 2 months ago

It is hard to clean those files after the tests - because the way we created them is using mkdtempSync.

  const coverageDirectory = mkdtempSync(join(tmpdir(), 'node-coverage-'));

which is a unique (random) temporary directory, e.g. node-coverage-060f7B. And I don't know how to obtain the unique (random) part.

jakecastelli commented 2 months ago

I attempted to remove the tmp dir by doing this:

     } finally {
       if (dir) {
         dir.closeSync();
+        rmSync(this.coverageDirectory, {__proto__: null, recursive: true });
       }
     }
   }

in the cleanup function, however even the rmSync was *successful, the temp dir was still there (looks like somehow it was being recreated again)

Any idea? 👀 @cjihrig

cjihrig commented 2 months ago

Not sure without trying to debug it myself.

jakecastelli commented 2 months ago

If you could try to cd into $TMPDIR and rm -rf node-coverage-* then you could add the patch into the cleanup function in coverage.js (make sure you import rmSync at the top) and run ./node test/parallel/test-runner-coverage.js. In theory we shouldn't see any node-covearge-* but in reality they are still there.

jakecastelli commented 2 months ago

Sorry for being annoying to add another quick note here, the whole reason I wanted to fix this issue is because each time we run the above test (test-runner-coverage.js) it will create the rest (which is a few mb) into the system tmp folder:

292K    node-coverage-0q1ib3
1.3M    node-coverage-2N0qE9
460K    node-coverage-LHyiP2
 68K    node-coverage-OMUlHM
524K    node-coverage-Pz009O
792K    node-coverage-RZckL8
556K    node-coverage-WVJ1fu
460K    node-coverage-gnrE0T
556K    node-coverage-hFiDtO
1.3M    node-coverage-mVUPrO
556K    node-coverage-nuKBuD
800K    node-coverage-s1yPWE
 68K    node-coverage-tVAHAL
528K    node-coverage-teJBAK

This is fine for local but on CI it would easily fill up the space.

targos commented 2 months ago

test-digitalocean-fedora39-x64-2 is affected too

[root@test-digitalocean-fedora39-x64-2 ~]# ls /tmp/node-coverage-* | wc -l
18601
targos commented 2 months ago

I cleaned it up, and -1 too.

richardlau commented 2 months ago

/tmp on test-equinix_mnx-smartos20-x64-3 was full again. I have deleted all of those node-coverage-* directories.