nodejs / build

Better build and test infra for Node.
507 stars 166 forks source link

Disk full on Jenkins CI server #3747

Open targos opened 5 months ago

targos commented 5 months ago

I'm looking into it

targos commented 5 months ago

Similar to https://github.com/nodejs/build/issues/3288

I logged into the backup server and ran /root/backup_scripts/remove_old.sh ci.nodejs.org. It freed 100GB.

targos commented 3 months ago

It happened again.

@ryanaslett Maybe the new backup server is not setup to run the cleanup script regularly?

ryanaslett commented 3 months ago

Hmm. Its setup in the crontab: 40 23 * * 6 /usr/bin/rsnapshot -c /usr/local/etc/rsnapshot.conf weekly && /root/backup_scripts/remove_old.sh ci-release.nodejs.org && /root/backup_scripts/remove_old.sh ci.nodejs.org

It should be clearing it out once a week.

The backup server lacks any kind of monitoring or alerting if those tasks do not succeed for whatever reason, so we should probably come up with a strategy to be notified if those crons fail for whatever reason.

targos commented 3 months ago

In that case, I think the problem is clear.

Running remove_old.sh ci-release.nodejs.org ends up with an error:

# /root/backup_scripts/remove_old.sh ci-release.nodejs.org
curl: (92) HTTP/2 stream 1 was not closed cleanly before end of the underlying stream
# echo $?
92

So the script never gets a chance to be executed for ci.nodejs.org

ryanaslett commented 3 months ago

remove_old.sh ssh'es into ci, and ci-release and blows away any jobs older than 22 days, then triggers a jenkins reload to recognize the jobs are missing.

The credentials for jenkins were for a jenkins user jbergstroem

jbergstroem is missing the Overall/Read permission Is the error given.

Not sure when they were removed from the Nodejs/build github team, but thats the last time this script probably executed successfully.

I've replaced the credentials with an API token for my account for now and have ran it for ci, but Im not sure how jbergstroem had one api key that worked with both ci and ci-release (maybe moved it over to release from ci somehow?)

The cron will currently delete the jobs on ci, and refresh and then delete the jobs on ci-release and then fail to refresh because its using the same api token.

This should probably be a service account with proper permissions to access the /reload path.

OTOH, this seems like a brittleway to avoid using jenkins own job cleanup mechanism:

image

My recommendation is that we change the jobs on the release server first (since theres only a handful) and remove this cleanup mechanism from ci-release first, and then modify the jobs on ci.nodejs.org to also clean up after themselves.