Closed ljackiewicz closed 4 weeks ago
Figured out what's going on.
ZSTD is slower and more resource intensive. Additionally: my backup drive is a pretty slow NFS link.
SH-ing into the backup container and running time bash -f backup.sh 9999
a few times gave me around about 1.5m to 3m to create a backup of about 1GiB compressed.
The Jenkins Pod has a terminationGracePeriodSeconds: 30
so I assume that the the operator/user killing the pod results in the backup container only having 30s to do the backup before the pod (with backup container) is deleted by k8s.
Looking at the Jenkins CRD I didn't see a terminationGracePeriodSeconds option so instead I did makeBackupBeforePodDeletion: false
and did interval: 500
because I also realized that multiple backups were running at the same time with the default interval: 30
That fixed it for me.
Haven't tested without limits [Quota Scheduler] on the backup container so unsure if that greatly effects ZSTD compression speed.
Also I observed that my "interrupted" backups were smaller then "full" backups and that they erred with a "Unexpected EOF"
It's actually very easy to interrupt them just during normal operation and it completely breaks the operator. Looking for better workarounds
I was not able to replicate this issue, can you send some operator logs and "ls -l" in the /backup
directory?
we can add another step in the backup script that will verify the backup.
However you can also use the old gzip mode by using an old version of the backup img like specified here: https://github.com/jenkinsci/kubernetes-operator/releases/tag/v0.8.0-beta2
Also if can happen that we have corrupted backups something is not working in this trap: https://github.com/jenkinsci/kubernetes-operator/blob/63e8a76b95d9a1cf25a68151f031e9702fd7d10a/backup/pvc/bin/backup.sh#L9 that should prevent a malformed file.
Are you sure that or the jenkins master or the backup container are not restarting? do you have any pod restart?
In my instances I never saw this error, any logs or more info can help to understand the issue
It seems very easy to broke jenkins by deleting pod during backup in progress.
I suppose need to make backup creation atomic operation, for example create backup outside of backup directory or in temporal subdirectory and use mv
to "commit" backup.
Also need a functionality to cleanup previous old failed backup related files.
It seems very easy to broke jenkins by deleting pod during backup in progress.
you can use this:
Also need a functionality to cleanup previous old failed backup related files.
I'm open to receiving any PRs, the backup script code is here and it's not complex to modify/extend.
Here the PR to fix corrupted backups https://github.com/jenkinsci/kubernetes-operator/pull/1000
I suspect this one may be related to #1015
The new 0.8.1 should fix this issue, let me know if it's not like that, drop a comment and I will re-open the issue.
Describe the bug If for some reason the backup process did not run correctly (the backup file was corrupted - it was significantly smaller in size and could not be uncompress), the restore process won't restore the instance correctly, and therefore, the seed job won't generate jobs, which results in instance downtime.
Perhaps the backup or restore process should check whether the last backup file is not corrupted - whether the backup was performed correctly.
To Reproduce Not sure why backup ended with corrupted file.
Additional information Jenkins Operator version: v0.8.0-beta2
Unfortunately, I didn't keep the operator's logs of this run, but if it happens again (it wasn't the first time), I will add them in the comment below.