Make gRPC service report backups as FAILED if lose their callback future

rzvoncek commented 5 months ago

Fixes https://github.com/k8ssandra/k8ssandra-operator/issues/1312.

This PR alters how the BackupMan reports a status of the backup. Previously, it'd just look at the storage, and report the state of the backup as IN_PROGRESS if the backup has started (~wrote something into the storage). It'd not care if the backup is actually ongoing.

With this change, it will also check if there is an active future waiting for the backup to finish. The future's presence indicates a healthy state, which in turn means the backup has a high chance of happening.

However, waiting for the future to complete and run the associated callback is not a requirement for the backup to complete. When I tried to cover this in the integration steps, I was able to restart the gRPC server, but I was unable to kill the actual process that does the backup. So the new server (with new BackupMan) came back and saw the backup as complete.

The benefit of this change is more visible in the world of k8ssandra-operator, where a pod might restart mid-backup. Killing a pod. during a backup, would lead to the following situation:

kubectl get MedusaBackupJob,MedusaBackup -n k8ssandra-operator

NAME                                           STARTED   FINISHED
medusabackupjob.medusa.k8ssandra.io/backup-1   12m       11m
medusabackupjob.medusa.k8ssandra.io/backup-2   7m38s     7m23s
medusabackupjob.medusa.k8ssandra.io/backup-3   51s

NAME                                        STARTED   FINISHED   NODES   FILES   SIZE        COMPLETED   STATUS
medusabackup.medusa.k8ssandra.io/backup-1   12m       11m        1       184     143.53 KB   1           SUCCESS
medusabackup.medusa.k8ssandra.io/backup-2   7m38s     7m23s      1       296     227.16 KB   1           SUCCESS

The backup-3 would never finish, it'd continue to be reported as IN_PROGRESS. This is because the MedusaBackupJob would always feature:

status:
  inProgress:
    - firstcluster-dc1-default-sts-0
  startTime: '2024-06-19T14:05:18Z'

Deploying a Medusa container built off this branch actually heals the situation:

Every 1.0s: kubectl get MedusaBackupJob,MedusaBackup -n k8ssandra-operator                                                                                                                       

NAME                                           STARTED   FINISHED
medusabackupjob.medusa.k8ssandra.io/backup-1   15m       15m
medusabackupjob.medusa.k8ssandra.io/backup-2   10m       10m
medusabackupjob.medusa.k8ssandra.io/backup-3   4m9s      17s

NAME                                        STARTED   FINISHED   NODES   FILES   SIZE        COMPLETED   STATUS
medusabackup.medusa.k8ssandra.io/backup-1   15m       15m        1       184     143.53 KB   1           SUCCESS
medusabackup.medusa.k8ssandra.io/backup-2   10m       10m        1       296     227.16 KB   1           SUCCESS
medusabackup.medusa.k8ssandra.io/backup-3   4m9s      17s        1               0.00 B                  FAILED

Because:

status:
  failed:
    - firstcluster-dc1-default-sts-0
  finishTime: '2024-06-19T14:09:10Z'
  startTime: '2024-06-19T14:05:18Z'

sonarcloud[bot] commented 5 months ago

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

sonarcloud[bot] commented 1 month ago

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

thelastpickle / cassandra-medusa

Make gRPC service report backups as FAILED if lose their callback future #786

Quality Gate passed

Quality Gate passed