jenkinsci / kubernetes-operator

Kubernetes native Jenkins Operator
https://jenkinsci.github.io/kubernetes-operator
Other
591 stars 231 forks source link

Backup is Triggered with same backupNumber #962

Closed pniederlag closed 3 days ago

pniederlag commented 5 months ago

Due to the size of the backup we only trigger a backup every two hours (interval: 7200).

For some - yet unknown - reasons in some cases the backup is being triggered twice in a very short time. Our backup script by now detects that problem and will exit (without error).

Find an excerpt from the logs below.

backups 1284 and 1287 did work according to expectations. backups 1285, 1286, 1288 und 1289 have been run twice

We have already slightly adapted the backup and restore scripts in order to trace the issues. A crucial part might be the proper way of emitting and handling signals in the script.

grep -i backup /tmp/jenkins-operator.log | grep -E 'Performing|Skipping backup'
2024-01-25T23:06:03.961Z    INFO    controller-jenkins  Performing backup '1284'    {"cr": "prod"}
2024-01-25T23:06:55.151Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-25T23:35:22.977Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-25T23:35:23.758Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T00:38:18.545Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T00:38:19.286Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T01:06:03.918Z    INFO    controller-jenkins  Performing backup '1285'    {"cr": "prod"}
2024-01-26T01:06:55.557Z    INFO    controller-jenkins  Performing backup '1285'    {"cr": "prod"}
2024-01-26T01:06:56.787Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T01:06:57.705Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T01:35:37.510Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T01:35:38.172Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T02:32:46.358Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T02:32:47.077Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T02:33:58.747Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T02:33:59.429Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T03:06:03.917Z    INFO    controller-jenkins  Performing backup '1286'    {"cr": "prod"}
2024-01-26T03:06:55.142Z    INFO    controller-jenkins  Performing backup '1286'    {"cr": "prod"}
2024-01-26T03:06:56.532Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T03:06:57.253Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T03:32:40.676Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T03:32:41.379Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T04:29:58.425Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T04:29:59.113Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T05:06:05.006Z    INFO    controller-jenkins  Performing backup '1287'    {"cr": "prod"}
2024-01-26T05:06:56.149Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T05:06:56.870Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T05:30:03.700Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T05:30:04.440Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T05:31:51.007Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T05:31:52.746Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T06:22:14.584Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T06:22:16.399Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T06:28:27.716Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T06:28:28.687Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T07:06:03.963Z    INFO    controller-jenkins  Performing backup '1288'    {"cr": "prod"}
2024-01-26T07:06:55.935Z    INFO    controller-jenkins  Performing backup '1288'    {"cr": "prod"}
2024-01-26T07:06:57.606Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T07:06:58.476Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T07:23:46.648Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T07:25:58.041Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T07:25:58.733Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T08:23:19.967Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T08:23:20.914Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T09:03:23.164Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T09:03:23.898Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T09:06:04.003Z    INFO    controller-jenkins  Performing backup '1289'    {"cr": "prod"}
2024-01-26T09:06:55.305Z    INFO    controller-jenkins  Performing backup '1289'    {"cr": "prod"}
2024-01-26T09:06:56.471Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T09:06:57.307Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T09:25:09.829Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T09:25:10.578Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T09:39:58.047Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}
2024-01-26T09:39:58.783Z    DEBUG   controller-jenkins  Skipping backup {"cr": "prod"}

used backup script


  backup.sh: |
    #!/usr/bin/env bash

    set -eo pipefail
    #set -xv

    [[ ! $# -eq 1 ]] && echo "Usage: $0 backup_number" && exit 1;
    [[ -z "${BACKUP_DIR}" ]] && echo "Required 'BACKUP_DIR' env not set" && exit 1;
    [[ -z "${JENKINS_HOME}" ]] && echo "Required 'JENKINS_HOME' env not set" && exit 1;

    backup_number=$1
    echo "Running backup"

    LOCK_FILE="${BACKUP_DIR}/.backup-running"
    # debug logging to check for weird errors START
    echo "----------------------------------------" >> ${BACKUP_DIR}/debug.log
    echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} START" >> ${BACKUP_DIR}/debug.log

    # 1. check: test LOCK_FILE to prevent concurrent backups
    if [[ -f "${LOCK_FILE}" ]]; then
      echo "$(date +'%F %H:%M:%S %Z') lock ${LOCK_FILE} exists, SHOULD NOT HAPPEN" >> ${BACKUP_DIR}/debug.log
      # ignore if older than one day
      if [ "$(find ${LOCK_FILE} -mtime +1)" ]; then
        echo "$(date +'%F %H:%M:%S %Z') ignoring ${LOCK_FILE} as its more than one day old" >> ${BACKUP_DIR}/debug.log
      else
        echo "$(date +'%F %H:%M:%S %Z') skipping backup because it is locked by ${LOCK_FILE}, SHOULD NOT HAPPEN" >> ${BACKUP_DIR}/debug.log
        exit 0
      fi
    fi

    BACKUP_TMP_DIR=$(mktemp -d)
    # as we now work via tmp-dir it is very unlikely the backup already exists in final destination
    # yet we keep this check in here as an additional safety check
    # 2. check: this backup number already finished?
    if [[ -e ${BACKUP_DIR}/${backup_number}.tar.gz ]]; then
      echo "${backup_number} already existing, SHOULD NOT HAPPEN" >> ${BACKUP_DIR}/debug.log
      echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} FAIL/EXISTS" >> ${BACKUP_DIR}/debug.log
      exit 0
    fi
    # debug logging to check for weird errors END

    # Start backup by setting the lock-file, prevent concurrent executions
    touch ${LOCK_FILE}
    # the user of the jenkins-operator will create a fresh/new directory on each installation
    # that's the reason we keep the directory of this special user out of the backup
    # we must catch exit code 1, as tar will exit with status 1 when files have been changed during backup
    tar -C ${JENKINS_HOME} --ignore-failed-read --no-wildcards-match-slash --anchored --exclude=jobs/seed-job-post-init --exclude=jobs/*/workspace* --exclude=jobs/*/config.xml --exclude=jobs/*/state.xml --exclude=users/jenkinsoperato_* --exclude=users/jenkinsdev_* --exclude=jobs/**/builds/[0-9]*/libs -czf "${BACKUP_TMP_DIR}/${backup_number}.tar.gz" jobs users userContent ||  [[ $? -eq 1 ]]
    cp ${BACKUP_TMP_DIR}/${backup_number}.tar.gz ${BACKUP_DIR}/${backup_number}.tar.gz

    # Cleanup of tmp dir and lock-file
    rm -rf ${BACKUP_TMP_DIR}
    rm -f ${LOCK_FILE}

    if [[ ! -s ${BACKUP_DIR}/${backup_number}.tar.gz ]]; then
      echo "backup file '${BACKUP_DIR}/${backup_number}.tar.gz' is empty"
      echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} FAIL/EMPTY" >> ${BACKUP_DIR}/debug.log
      exit 1
    fi

    echo "$(date +'%F %H:%M:%S %Z') Custom backup ${backup_number} DONE" >> ${BACKUP_DIR}/debug.log

    echo "Done"
    exit 0
pniederlag commented 5 months ago

Same thing is true for restoring the backup... btw... we see two consecutive runs

2024-02-01T13:37:48.329Z    INFO    controller-jenkins  Restoring backup '1386' {"cr": "test"}
2024-02-01T13:38:43.901Z    INFO    controller-jenkins  Restoring backup '1386' {"cr": "test"}

Why is this message triggered twice? and why does reconcilliation continue without waiting for the restore to finish?

We have confirmed on the pod/container that there are two instances of restore.sh running... which seems not the way it should work

brokenpip3 commented 5 months ago

We have confirmed on the pod/container that there are two instances of restore.sh running... which seems not the way it should work

this is odd, can you send me the ps -ef inside the backup container? I never saw this issue with few seconds between 2 backups. Can you also send the jenkins crd? remove the sensitive parts, only the backup config

pniederlag commented 5 months ago

Thx for taking a look into it... Here is the output of 'ps -ef'.

root@jenkins-test:/home/user/bin# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 10:39 ?        00:00:00 /bin/sh -c ./run.sh
root           7       1  0 10:39 ?        00:00:00 bash ./run.sh
root           9       7  0 10:39 ?        00:00:00 sleep 7200
root         148       0  0 10:42 ?        00:00:00 bash /home/user/bin/restore.sh 1433
root         162     148  0 10:42 ?        00:00:00 sleep 900
root         302       0  0 10:43 ?        00:00:00 bash /home/user/bin/restore.sh 1433
root         316     302  0 10:43 ?        00:00:00 sleep 900
root         332       0  0 10:49 pts/0    00:00:00 sh -c clear; (bash || ash || sh)
root         339     332  0 10:49 pts/0    00:00:00 sh -c clear; (bash || ash || sh)
root         340     339  0 10:49 pts/0    00:00:00 bash
root         341     340  0 10:49 pts/0    00:00:00 ps -ef

and the crd

jenkins-test-yaml.txt

scripts are adapted and mounted from our configmap.

It might be worth mentioning that we have PROD and TEST running in the very same cluster on different namespaces... (operator is duplicated as well and each one only watching its own namespace)

brokenpip3 commented 2 months ago

This is clearly a new instance of the operator that is trying to restore the same backup, we may need to add a more safe mechanism that will avoid the operator to run again the restore command and the backup to avoid to start if another is running. Will do somewhere in the near future

DionJones615 commented 2 weeks ago

We've also seen duplicated restores, but in this case it's every time on every instance and they appear to be consecutive, not concurrent.

This PR fixed it for us. Unsure if this is related, but worth a look. #1021

After giving it a second look, I think the backup issue occurs if jenkins-operator restarts. Similar to the restore issue I discovered, the .Update() function can fail if the client is not defined yet. Similar to my fix for Restore, I added the same .Get() for Backup.

brokenpip3 commented 2 weeks ago

Thanks a lot for the restore fix! However for the double backup/restore it's more complicated than this. The problem is that with the current logic one of the first action will be to run the backup script in jenkins with a simple kube exec into the pod. With this logic if the operator will crash/respawn at the same time after the restart it will be not able to understand if there is a backup process that is still in running in the jenkins pod. This could be fixed at 2 level:

Honestly based on the operator code I'm more incline to add this logic to the backup script, since the original authors of this operator make the operator to be agnostic of the type and logic of the backup. It should be not so complex, I can try in the following days

brokenpip3 commented 2 weeks ago

I started the work on this PR

you can test it with this temporary image: quay.io/jenkins-kubernetes-operator/backup-pvc:7a4cbf98

brokenpip3 commented 2 weeks ago

Ok I think that now is stable to test: quay.io/jenkins-kubernetes-operator/backup-pvc:5f5c8e17 any feedback will be appreciated, I will add some docs before merging the PR.

brokenpip3 commented 3 days ago

The new 0.8.1 should fix this issue, let me know if it's not like that, drop a comment and I will re-open the issue.