Closed richardlau closed 2 years ago
Tried to manually run the clean up script (ref https://github.com/nodejs/build/issues/2453#issuecomment-702690181) but that hasn't worked:
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# df -h
Filesystem Size Used Avail Use% Mounted on
zones/6dbfb615-b1ac-4f9a-8006-2cb45b87e4cb 3.0T 3.0T 0 100% /
/.zonecontrol 7.6T 9.5G 7.6T 1% /.zonecontrol
/lib 290M 247M 43M 86% /lib
/lib/svc/manifest 7.6T 1.6M 7.6T 1% /lib/svc/manifest
/usr 433M 333M 101M 77% /usr
swap 128G 33G 96G 26% /etc/svc/volatile
swap 32G 32G 0 100% /tmp
swap 128G 33G 96G 26% /var/run
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#
Looks like it's
/root/backup_scripts/remove_old.sh ci-release.nodejs.org
that's failing with the expired certificate. With ci.nodejs.org
there's a different error:
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci.nodejs.org
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 No valid crumb was included in the request</title>
</head>
<body><h2>HTTP ERROR 403 No valid crumb was included in the request</h2>
<table>
<tr><th>URI:</th><td>/reload</td></tr>
<tr><th>STATUS:</th><td>403</td></tr>
<tr><th>MESSAGE:</th><td>No valid crumb was included in the request</td></tr>
<tr><th>SERVLET:</th><td>Stapler</td></tr>
</table>
<hr><a href="https://eclipse.org/jetty">Powered by Jetty:// 9.4.33.v20201020</a><hr/>
</body>
</html>
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# df -h
Filesystem Size Used Avail Use% Mounted on
zones/6dbfb615-b1ac-4f9a-8006-2cb45b87e4cb 3.0T 3.0T 0 100% /
/.zonecontrol 7.6T 9.5G 7.6T 1% /.zonecontrol
/lib 290M 247M 43M 86% /lib
/lib/svc/manifest 7.6T 1.6M 7.6T 1% /lib/svc/manifest
/usr 433M 333M 101M 77% /usr
swap 128G 33G 96G 26% /etc/svc/volatile
swap 32G 32G 0 100% /tmp
swap 128G 33G 96G 26% /var/run
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#
although grafana shows a drop in disk space usage 😕
Is this resolved? It looks like there have been CI jobs running/yellow.
The CI appears to be working but I don't know if we need to look at the errors from the cleanup script or the 100% disk usage on the backup server.
Backup server being full means CI machines don't get pruned. Someone needs to budget some time to get on top of the backup strategy and figure out how to make this work better and not keep on filling up backup storage - do we need to allocate more storage? do we need to prune old more agressively? are we backing up the right things?
@rvagg I'm assuming you currently have the most knowledge on that front. Would you have time to pair with somebody to do a knowledge transfer as a kickoff?
Not really, this was @jbergstroem's baby and I've mostly avoided having to dig too deeply into it. It's just rsnapshot so not too complicated, but there's questions of configuration and disk size that need to be investigated. Someone with access should just hop in and explore the rsnapshot config and work that out I suppose.
Ok then 2 questions.
1) @jbergstroem would you be able to do a brain dump with @AshCripps 2) Would we be ok with giving Ash access (he does not have full infra access yet)
I guess I'm OK with Ash having access, it's pretty close to crown-jewels in the sense that if you have access to that you have access to everything. Although I can't recall whether that machine can access our other infra or the other way around (IIRC Johan preferred to not have a single server with access to all the servers because it's a single-point-of-compromise). But if we're that short on infra availability then I suppose we don't have much choice. Are @richardlau || @mmarchini candidates for taking some ownership of this perchance?
We have alerting set up via grafana now and the 95% disk full alert triggered this morning for the ci server. I've run the remove_old.sh
script on the backup machine which has dropped the disk usage down to 31%.
We got another alert via #nodejs-build-infra-alerts on Slack
I've tried logging into backup
to run the backup scripts but this is still erroring (https://github.com/nodejs/build/issues/2543#issuecomment-780529053) so presumably the weekly cron is also broken (I'm not sure where cron logs are on this machine) 😞:
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#
And still (https://github.com/nodejs/build/issues/2543#issuecomment-780533159) "no valid crumb" for ci.nodejs.org (but the space does appear to be reclaimed:
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci.nodejs.org
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 No valid crumb was included in the request</title>
</head>
<body><h2>HTTP ERROR 403 No valid crumb was included in the request</h2>
<table>
<tr><th>URI:</th><td>/reload</td></tr>
<tr><th>STATUS:</th><td>403</td></tr>
<tr><th>MESSAGE:</th><td>No valid crumb was included in the request</td></tr>
<tr><th>SERVLET:</th><td>Stapler</td></tr>
</table>
<hr><a href="https://eclipse.org/jetty">Powered by Jetty:// 9.4.43.v20210629</a><hr/>
</body>
</html>
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#
I'm kind of confused because I thought we used the same cert on the release and public CI servers.
It looks like there is some difference:
on ci
root@infra-digitalocean-ubuntu14-x64-1:/etc/nginx/ssl# ls -al
total 32
drwxr-xr-x 2 root root 4096 Jun 5 2020 .
drwxr-xr-x 7 root root 4096 Jun 5 2020 ..
-rw-r--r-- 1 root root 769 Nov 8 2015 dhparam.pem
-rw-r--r-- 1 root root 4809 Jun 5 2020 nodejs_chained.crt
-rw-r--r-- 1 root root 6766 Jun 5 2020 nodejs_chained.crt.old_addtrust
-rw-r--r-- 1 root root 3272 Nov 20 2019 nodejs.key
root@infra-digitalocean-ubuntu14-x64-1:/etc/nginx/ssl#
compared to ci-release
root@infra-ibm-ubuntu1804-x64-1:/etc/nginx/ssl# ls -al
total 24
drwxr-xr-x 2 root root 4096 Apr 19 2021 .
drwxr-xr-x 9 root root 4096 May 27 06:47 ..
-rw-r--r-- 1 root root 769 Apr 19 2021 dhparam.pem
-rw-r--r-- 1 root root 3272 Apr 19 2021 nodejs.key
-rw-r--r-- 1 root root 6766 Apr 19 2021 nodejs_chained.crt
root@infra-ibm-ubuntu1804-x64-1:/etc/nginx/ssl#
i.e. ci-release
's nodejs_chained.crt
is the same size as nodejs_chained.crt.old_addtrust
on ci
.
I suspect we're running into https://sectigo.com/knowledge-base/detail/Sectigo-AddTrust-External-CA-Root-Expiring-May-30-2020/kA03l00000117LT and backup
using a OpenSSL 1.0.x based version of curl
(ci-release.nodejs.org's cert validates locally for me in a web browser).
We can probably copy the smaller nodejs_chained.crt
from ci
over to ci-release
but I'm not going to do that last thing on a Friday afternoon while feeling a bit sluggish from the COVID-19 booster shot I had yesterday.
I've backup up nodejs_chained.crt
as nodejs_chained.crt.old_addtrust
on ci-release
and then scp
'ed the smalled nodejs_chained.crt
over from ci
and restarted nginx
(systemctl status nginx.service
). This has now fixed the "certificate has expired" problem, although we now get the same "No valid crumb" error when we attempt to reload Jenkins (but at least that is consistent with ci
):
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]# /root/backup_scripts/remove_old.sh ci-release.nodejs.org
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 403 No valid crumb was included in the request</title>
</head>
<body><h2>HTTP ERROR 403 No valid crumb was included in the request</h2>
<table>
<tr><th>URI:</th><td>/reload</td></tr>
<tr><th>STATUS:</th><td>403</td></tr>
<tr><th>MESSAGE:</th><td>No valid crumb was included in the request</td></tr>
<tr><th>SERVLET:</th><td>Stapler</td></tr>
</table>
<hr><a href="https://eclipse.org/jetty">Powered by Jetty:// 9.4.43.v20210629</a><hr/>
</body>
</html>
[root@3a355104-c5d6-405f-863b-9ce5948ba77b ~]#
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.
grafana:
Jenkins system log: