Closed pniederlag closed 3 days ago
Same thing is true for restoring the backup... btw... we see two consecutive runs
2024-02-01T13:37:48.329Z INFO controller-jenkins Restoring backup '1386' {"cr": "test"}
2024-02-01T13:38:43.901Z INFO controller-jenkins Restoring backup '1386' {"cr": "test"}
Why is this message triggered twice? and why does reconcilliation continue without waiting for the restore to finish?
We have confirmed on the pod/container that there are two instances of restore.sh running... which seems not the way it should work
We have confirmed on the pod/container that there are two instances of restore.sh running... which seems not the way it should work
this is odd, can you send me the ps -ef inside the backup container? I never saw this issue with few seconds between 2 backups. Can you also send the jenkins crd? remove the sensitive parts, only the backup config
Thx for taking a look into it... Here is the output of 'ps -ef'.
root@jenkins-test:/home/user/bin# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 10:39 ? 00:00:00 /bin/sh -c ./run.sh
root 7 1 0 10:39 ? 00:00:00 bash ./run.sh
root 9 7 0 10:39 ? 00:00:00 sleep 7200
root 148 0 0 10:42 ? 00:00:00 bash /home/user/bin/restore.sh 1433
root 162 148 0 10:42 ? 00:00:00 sleep 900
root 302 0 0 10:43 ? 00:00:00 bash /home/user/bin/restore.sh 1433
root 316 302 0 10:43 ? 00:00:00 sleep 900
root 332 0 0 10:49 pts/0 00:00:00 sh -c clear; (bash || ash || sh)
root 339 332 0 10:49 pts/0 00:00:00 sh -c clear; (bash || ash || sh)
root 340 339 0 10:49 pts/0 00:00:00 bash
root 341 340 0 10:49 pts/0 00:00:00 ps -ef
and the crd
scripts are adapted and mounted from our configmap.
It might be worth mentioning that we have PROD and TEST running in the very same cluster on different namespaces... (operator is duplicated as well and each one only watching its own namespace)
This is clearly a new instance of the operator that is trying to restore the same backup, we may need to add a more safe mechanism that will avoid the operator to run again the restore command and the backup to avoid to start if another is running. Will do somewhere in the near future
We've also seen duplicated restores, but in this case it's every time on every instance and they appear to be consecutive, not concurrent.
This PR fixed it for us. Unsure if this is related, but worth a look. #1021
After giving it a second look, I think the backup issue occurs if jenkins-operator restarts. Similar to the restore issue I discovered, the .Update()
function can fail if the client is not defined yet. Similar to my fix for Restore, I added the same .Get()
for Backup.
Thanks a lot for the restore fix! However for the double backup/restore it's more complicated than this. The problem is that with the current logic one of the first action will be to run the backup script in jenkins with a simple kube exec into the pod. With this logic if the operator will crash/respawn at the same time after the restart it will be not able to understand if there is a backup process that is still in running in the jenkins pod. This could be fixed at 2 level:
Honestly based on the operator code I'm more incline to add this logic to the backup script, since the original authors of this operator make the operator to be agnostic of the type and logic of the backup. It should be not so complex, I can try in the following days
I started the work on this PR
you can test it with this temporary image: quay.io/jenkins-kubernetes-operator/backup-pvc:7a4cbf98
Ok I think that now is stable to test: quay.io/jenkins-kubernetes-operator/backup-pvc:5f5c8e17
any feedback will be appreciated, I will add some docs before merging the PR.
The new 0.8.1 should fix this issue, let me know if it's not like that, drop a comment and I will re-open the issue.
Due to the size of the backup we only trigger a backup every two hours (interval: 7200).
For some - yet unknown - reasons in some cases the backup is being triggered twice in a very short time. Our backup script by now detects that problem and will exit (without error).
Find an excerpt from the logs below.
backups 1284 and 1287 did work according to expectations. backups 1285, 1286, 1288 und 1289 have been run twice
We have already slightly adapted the backup and restore scripts in order to trace the issues. A crucial part might be the proper way of emitting and handling signals in the script.
used backup script