CI unable to start jobs -- disk full

richardlau commented 3 years ago

grafana:

Jenkins system log:

Feb 17, 2021 7:36:15 AM WARNING net.bull.javamelody.internal.common.JavaLogger warn
exception while collecting data: java.io.IOException: No space left on device
java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:326)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
    at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
    at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
    at net.bull.javamelody.internal.model.CounterStorage$CounterOutputStream.write(CounterStorage.java:73)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
    at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
    at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
    at net.bull.javamelody.internal.model.CounterStorage.writeToFile(CounterStorage.java:130)
    at net.bull.javamelody.internal.model.CounterStorage.writeToFile(CounterStorage.java:117)
    at net.bull.javamelody.internal.model.Counter.writeToFile(Counter.java:962)
    at net.bull.javamelody.internal.model.Collector.collectCounterData(Collector.java:742)
    at net.bull.javamelody.internal.model.Collector.collect(Collector.java:363)
    at net.bull.javamelody.internal.model.Collector.collectWithoutErrors(Collector.java:329)
    at net.bull.javamelody.NodesCollector.collectWithoutErrorsNow(NodesCollector.java:173)
    at net.bull.javamelody.NodesCollector.collectWithoutErrors(NodesCollector.java:147)
    at net.bull.javamelody.NodesCollector$1.run(NodesCollector.java:91)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)

richardlau commented 3 years ago

Tried to manually run the clean up script (ref https://github.com/nodejs/build/issues/2453#issuecomment-702690181) but that hasn't worked:

[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# df -h
Filesystem                                  Size  Used Avail Use% Mounted on
zones/6dbfb615-b1ac-4f9a-8006-2cb45b87e4cb  3.0T  3.0T     0 100% /
/.zonecontrol                               7.6T  9.5G  7.6T   1% /.zonecontrol
/lib                                        290M  247M   43M  86% /lib
/lib/svc/manifest                           7.6T  1.6M  7.6T   1% /lib/svc/manifest
/usr                                        433M  333M  101M  77% /usr
swap                                        128G   33G   96G  26% /etc/svc/volatile
swap                                         32G   32G     0 100% /tmp
swap                                        128G   33G   96G  26% /var/run
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#

richardlau commented 3 years ago

Looks like it's

/root/backup_scripts/remove_old.sh ci-release.nodejs.org

that's failing with the expired certificate. With ci.nodejs.org there's a different error:

[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci.nodejs.org
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 No valid crumb was included in the request</title>
</head>
<body><h2>HTTP ERROR 403 No valid crumb was included in the request</h2>
<table>
<tr><th>URI:</th><td>/reload</td></tr>
<tr><th>STATUS:</th><td>403</td></tr>
<tr><th>MESSAGE:</th><td>No valid crumb was included in the request</td></tr>
<tr><th>SERVLET:</th><td>Stapler</td></tr>
</table>
<hr><a href="https://eclipse.org/jetty">Powered by Jetty:// 9.4.33.v20201020</a><hr/>

</body>
</html>
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# df -h
Filesystem                                  Size  Used Avail Use% Mounted on
zones/6dbfb615-b1ac-4f9a-8006-2cb45b87e4cb  3.0T  3.0T     0 100% /
/.zonecontrol                               7.6T  9.5G  7.6T   1% /.zonecontrol
/lib                                        290M  247M   43M  86% /lib
/lib/svc/manifest                           7.6T  1.6M  7.6T   1% /lib/svc/manifest
/usr                                        433M  333M  101M  77% /usr
swap                                        128G   33G   96G  26% /etc/svc/volatile
swap                                         32G   32G     0 100% /tmp
swap                                        128G   33G   96G  26% /var/run
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#

although grafana shows a drop in disk space usage 😕

mhdawson commented 3 years ago

Is this resolved? It looks like there have been CI jobs running/yellow.

richardlau commented 3 years ago

The CI appears to be working but I don't know if we need to look at the errors from the cleanup script or the 100% disk usage on the backup server.

rvagg commented 3 years ago

Backup server being full means CI machines don't get pruned. Someone needs to budget some time to get on top of the backup strategy and figure out how to make this work better and not keep on filling up backup storage - do we need to allocate more storage? do we need to prune old more agressively? are we backing up the right things?

mhdawson commented 3 years ago

@rvagg I'm assuming you currently have the most knowledge on that front. Would you have time to pair with somebody to do a knowledge transfer as a kickoff?

rvagg commented 3 years ago

Not really, this was @jbergstroem's baby and I've mostly avoided having to dig too deeply into it. It's just rsnapshot so not too complicated, but there's questions of configuration and disk size that need to be investigated. Someone with access should just hop in and explore the rsnapshot config and work that out I suppose.

mhdawson commented 3 years ago

Ok then 2 questions.

1) @jbergstroem would you be able to do a brain dump with @AshCripps 2) Would we be ok with giving Ash access (he does not have full infra access yet)

rvagg commented 3 years ago

I guess I'm OK with Ash having access, it's pretty close to crown-jewels in the sense that if you have access to that you have access to everything. Although I can't recall whether that machine can access our other infra or the other way around (IIRC Johan preferred to not have a single server with access to all the servers because it's a single-point-of-compromise). But if we're that short on infra availability then I suppose we don't have much choice. Are @richardlau || @mmarchini candidates for taking some ownership of this perchance?

richardlau commented 3 years ago

We have alerting set up via grafana now and the 95% disk full alert triggered this morning for the ci server. I've run the remove_old.sh script on the backup machine which has dropped the disk usage down to 31%.

richardlau commented 2 years ago

We got another alert via #nodejs-build-infra-alerts on Slack

I've tried logging into backup to run the backup scripts but this is still erroring (https://github.com/nodejs/build/issues/2543#issuecomment-780529053) so presumably the weekly cron is also broken (I'm not sure where cron logs are on this machine) 😞:

[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#

And still (https://github.com/nodejs/build/issues/2543#issuecomment-780533159) "no valid crumb" for ci.nodejs.org (but the space does appear to be reclaimed:

[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci.nodejs.org
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 No valid crumb was included in the request</title>
</head>
<body><h2>HTTP ERROR 403 No valid crumb was included in the request</h2>
<table>
<tr><th>URI:</th><td>/reload</td></tr>
<tr><th>STATUS:</th><td>403</td></tr>
<tr><th>MESSAGE:</th><td>No valid crumb was included in the request</td></tr>
<tr><th>SERVLET:</th><td>Stapler</td></tr>
</table>
<hr><a href="https://eclipse.org/jetty">Powered by Jetty:// 9.4.43.v20210629</a><hr/>

</body>
</html>
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#

I'm kind of confused because I thought we used the same cert on the release and public CI servers.

richardlau commented 2 years ago

It looks like there is some difference: on ci

root@infra-digitalocean-ubuntu14-x64-1:/etc/nginx/ssl# ls -al
total 32
drwxr-xr-x 2 root root 4096 Jun  5  2020 .
drwxr-xr-x 7 root root 4096 Jun  5  2020 ..
-rw-r--r-- 1 root root  769 Nov  8  2015 dhparam.pem
-rw-r--r-- 1 root root 4809 Jun  5  2020 nodejs_chained.crt
-rw-r--r-- 1 root root 6766 Jun  5  2020 nodejs_chained.crt.old_addtrust
-rw-r--r-- 1 root root 3272 Nov 20  2019 nodejs.key
root@infra-digitalocean-ubuntu14-x64-1:/etc/nginx/ssl#

compared to ci-release

root@infra-ibm-ubuntu1804-x64-1:/etc/nginx/ssl# ls -al
total 24
drwxr-xr-x 2 root root 4096 Apr 19  2021 .
drwxr-xr-x 9 root root 4096 May 27 06:47 ..
-rw-r--r-- 1 root root  769 Apr 19  2021 dhparam.pem
-rw-r--r-- 1 root root 3272 Apr 19  2021 nodejs.key
-rw-r--r-- 1 root root 6766 Apr 19  2021 nodejs_chained.crt
root@infra-ibm-ubuntu1804-x64-1:/etc/nginx/ssl#

i.e. ci-release's nodejs_chained.crt is the same size as nodejs_chained.crt.old_addtrust on ci.

I suspect we're running into https://sectigo.com/knowledge-base/detail/Sectigo-AddTrust-External-CA-Root-Expiring-May-30-2020/kA03l00000117LT and backup using a OpenSSL 1.0.x based version of curl (ci-release.nodejs.org's cert validates locally for me in a web browser).

We can probably copy the smaller nodejs_chained.crt from ci over to ci-release but I'm not going to do that last thing on a Friday afternoon while feeling a bit sluggish from the COVID-19 booster shot I had yesterday.

richardlau commented 2 years ago

I've backup up nodejs_chained.crt as nodejs_chained.crt.old_addtrust on ci-release and then scp'ed the smalled nodejs_chained.crt over from ci and restarted nginx (systemctl status nginx.service). This has now fixed the "certificate has expired" problem, although we now get the same "No valid crumb" error when we attempt to reload Jenkins (but at least that is consistent with ci):

[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci-release.nodejs.org
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 No valid crumb was included in the request</title>
</head>
<body><h2>HTTP ERROR 403 No valid crumb was included in the request</h2>
<table>
<tr><th>URI:</th><td>/reload</td></tr>
<tr><th>STATUS:</th><td>403</td></tr>
<tr><th>MESSAGE:</th><td>No valid crumb was included in the request</td></tr>
<tr><th>SERVLET:</th><td>Stapler</td></tr>
</table>
<hr><a href="https://eclipse.org/jetty">Powered by Jetty:// 9.4.43.v20210629</a><hr/>

</body>
</html>
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#

github-actions[bot] commented 2 years ago

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

nodejs / build

CI unable to start jobs -- disk full #2543