MongoDB Backup is stuck on Status:Waiting and backup-agent container is not doing anything after Kubernetes scheduler restarted the backup-agent container during the execution of a restore:
More about the problem
I expect to see an ongoing backup after asking for a backup through the PerconaServerMongoDBBackup yml definition, when other actions (backups / restores) are not in progress.
Steps to reproduce
Start a MongoDB cluster in unsafe mode with only 1 replica (this is usefull for development environments) and fill it with some data (let's say about 600MB of gzipped data);
Do a MongoDB backup and wait for the completion (Status = Ready) with the following yml (this will upload the backup to our AWS S3 bucket):
At this point Kubernetes scheduler should kill the backup-agent container from the MongoDB replica pod because of memory issues and restart it;
Now if you do a kubectl get psmdb-backup, you'll see that backup2 is in Error status and if you do a kubectl get psmdb-restore, you'll see that restore1 is also in Error status (OK, I can take that);
From this point onwards, no backup/restore will be possible through any yml, because they'll be appended as Status=Waiting.
The new backup-agent container logs state that it is waiting for incoming requests:
2024/03/05 16:36:01 [entrypoint] starting `pbm-agent`
2024-03-05T16:36:05.000+0000 I pbm-agent:
Version: 2.3.0
Platform: linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19
2024-03-05T16:36:05.000+0000 I starting PITR routine
2024-03-05T16:36:05.000+0000 I node: rs0/mongodb-percona-cluster-rs0-0.mongodb-percona-cluster-rs0.default.svc.cluster.local:27017
2024-03-05T16:36:05.000+0000 I listening for the commands
Versions
Kubernetes version v1.27.9 in a 8 nodes cluster with 4GB of RAM each, in Azure Cloud
Same bug applies also on cronjobs (so it's not an issue triggered by the on demand backup/restore requests): they are kept in Waiting status.
The bug does NOT happen when using a ReplicaSet with at least 3 replicas (the default topology).
Report
MongoDB Backup is stuck on Status:Waiting and backup-agent container is not doing anything after Kubernetes scheduler restarted the backup-agent container during the execution of a restore:
More about the problem
I expect to see an ongoing backup after asking for a backup through the PerconaServerMongoDBBackup yml definition, when other actions (backups / restores) are not in progress.
Steps to reproduce
Start a MongoDB cluster in unsafe mode with only 1 replica (this is usefull for development environments) and fill it with some data (let's say about 600MB of gzipped data);
Do a MongoDB backup and wait for the completion (Status = Ready) with the following yml (this will upload the backup to our AWS S3 bucket):
Drop collections on MongoDB replicaset (just to avoid the _id clashes at next point);
Now ask for a restore of the above backup with the following yml (this works as intended since I saw the logs and the data inside MongoDB ReplicaSet):
Ask for another backup with the following yml (keep in mind that at this point the previous restore process is still in progress)
The backup2 will be put on Status=Waiting;
At this point Kubernetes scheduler should kill the backup-agent container from the MongoDB replica pod because of memory issues and restart it;
Now if you do a
kubectl get psmdb-backup
, you'll see that backup2 is in Error status and if you do akubectl get psmdb-restore
, you'll see that restore1 is also in Error status (OK, I can take that);From this point onwards, no backup/restore will be possible through any yml, because they'll be appended as Status=Waiting.
The new backup-agent container logs state that it is waiting for incoming requests:
Versions
Anything else?
Same bug applies also on cronjobs (so it's not an issue triggered by the on demand backup/restore requests): they are kept in Waiting status. The bug does NOT happen when using a ReplicaSet with at least 3 replicas (the default topology).