uselagoon / build-deploy-tool

Tool to generate build resources
2 stars 5 forks source link

Lagoon schedules conflicting backup checks #325

Open smlx opened 2 weeks ago

smlx commented 2 weeks ago

Lagoon has created backup schedules for two different environments of the same project. Here is the diff showing an identical schedule in two different namespaces:

 apiVersion: backup.appuio.ch/v1alpha1
 kind: Schedule
 metadata:
   name: k8up-lagoon-backup-schedule
-  namespace: foo-staging
+  namespace: foo-pr-1037
 spec:
   backend:
     repoPasswordSecretRef:
       key: repo-pw
       name: baas-repo-pw
     s3:
       bucket: baas-cluster-id0/baas-foo
   backup:
     resources: {}
     schedule: 23 1 * * *
   check:
     resources: {}
     schedule: 23 7 * * 1
   prune:
     resources: {}
     retention:
       keepDaily: 7
       keepMonthly: 1
       keepWeekly: 6
     schedule: 23 4 * * 0

These schedules cause two checks to run at the same time (one in each namespace). This sometimes works if the repository is small (so the check is quick) but often doesn't work because restic takes an exclusive lock on the repository during a check. So if the check that wins the race to get a lock on the repository takes longer than the retry time of the other check jobs (which seems to be 5x over ~2 minutes) the other checks always fail.

The same problem exists with prune schedules because that command also takes an exclusive lock.

Backups are not affected because that command only take an append lock.

Ideas for solving the issue:

shreddedbacon commented 2 weeks ago

Yeah, this will need some thinking about.