Unable to restore backup ended with corrupted file

jenkinsci / kubernetes-operator

Kubernetes native Jenkins Operator

https://jenkinsci.github.io/kubernetes-operator

Other

596 stars 233 forks source link

Unable to restore backup ended with corrupted file #906

Closed ljackiewicz closed 4 weeks ago

ljackiewicz commented 10 months ago

Describe the bug If for some reason the backup process did not run correctly (the backup file was corrupted - it was significantly smaller in size and could not be uncompress), the restore process won't restore the instance correctly, and therefore, the seed job won't generate jobs, which results in instance downtime.

Perhaps the backup or restore process should check whether the last backup file is not corrupted - whether the backup was performed correctly.

To Reproduce Not sure why backup ended with corrupted file.

Additional information Jenkins Operator version: v0.8.0-beta2

Unfortunately, I didn't keep the operator's logs of this run, but if it happens again (it wasn't the first time), I will add them in the comment below.

WesselAtWork commented 10 months ago

Figured out what's going on. ZSTD is slower and more resource intensive. Additionally: my backup drive is a pretty slow NFS link. SH-ing into the backup container and running time bash -f backup.sh 9999 a few times gave me around about 1.5m to 3m to create a backup of about 1GiB compressed.

The Jenkins Pod has a terminationGracePeriodSeconds: 30 so I assume that the the operator/user killing the pod results in the backup container only having 30s to do the backup before the pod (with backup container) is deleted by k8s.

Looking at the Jenkins CRD I didn't see a terminationGracePeriodSeconds option so instead I did makeBackupBeforePodDeletion: false and did interval: 500 because I also realized that multiple backups were running at the same time with the default interval: 30

That fixed it for me.

Haven't tested without limits [Quota Scheduler] on the backup container so unsure if that greatly effects ZSTD compression speed.

WesselAtWork commented 10 months ago

Also I observed that my "interrupted" backups were smaller then "full" backups and that they erred with a "Unexpected EOF"

WesselAtWork commented 10 months ago

It's actually very easy to interrupt them just during normal operation and it completely breaks the operator. Looking for better workarounds

brokenpip3 commented 8 months ago

I was not able to replicate this issue, can you send some operator logs and "ls -l" in the /backup directory? we can add another step in the backup script that will verify the backup.

However you can also use the old gzip mode by using an old version of the backup img like specified here: https://github.com/jenkinsci/kubernetes-operator/releases/tag/v0.8.0-beta2

brokenpip3 commented 8 months ago

Also if can happen that we have corrupted backups something is not working in this trap: https://github.com/jenkinsci/kubernetes-operator/blob/63e8a76b95d9a1cf25a68151f031e9702fd7d10a/backup/pvc/bin/backup.sh#L9 that should prevent a malformed file.

Are you sure that or the jenkins master or the backup container are not restarting? do you have any pod restart?

brokenpip3 commented 8 months ago

In my instances I never saw this error, any logs or more info can help to understand the issue

evgenii-denisov commented 6 months ago

It seems very easy to broke jenkins by deleting pod during backup in progress. I suppose need to make backup creation atomic operation, for example create backup outside of backup directory or in temporal subdirectory and use mv to "commit" backup. Also need a functionality to cleanup previous old failed backup related files.

brokenpip3 commented 6 months ago

It seems very easy to broke jenkins by deleting pod during backup in progress.

you can use this:

https://github.com/jenkinsci/kubernetes-operator/blob/332eabe8d75f95fdfbfb78789852d299ae0118c1/chart/jenkins-operator/values.yaml#L221-L222

Also need a functionality to cleanup previous old failed backup related files.

I'm open to receiving any PRs, the backup script code is here and it's not complex to modify/extend.

skillcoder commented 3 months ago

Here the PR to fix corrupted backups https://github.com/jenkinsci/kubernetes-operator/pull/1000

DionJones615 commented 1 month ago

I suspect this one may be related to #1015

brokenpip3 commented 4 weeks ago

The new 0.8.1 should fix this issue, let me know if it's not like that, drop a comment and I will re-open the issue.