percona / percona-server-mongodb-operator

Percona Operator for MongoDB
https://www.percona.com/doc/kubernetes-operator-for-psmongodb/
Apache License 2.0
321 stars 138 forks source link

Backups/Restores are in Waiting Status after Kubernetes scheduler restarted the backup-agent container #1463

Open AlcipPopa opened 6 months ago

AlcipPopa commented 6 months ago

Report

MongoDB Backup is stuck on Status:Waiting and backup-agent container is not doing anything after Kubernetes scheduler restarted the backup-agent container during the execution of a restore:

Schermata del 2024-03-06 15-57-14

More about the problem

I expect to see an ongoing backup after asking for a backup through the PerconaServerMongoDBBackup yml definition, when other actions (backups / restores) are not in progress.

Steps to reproduce

Start a MongoDB cluster in unsafe mode with only 1 replica (this is usefull for development environments) and fill it with some data (let's say about 600MB of gzipped data);

Do a MongoDB backup and wait for the completion (Status = Ready) with the following yml (this will upload the backup to our AWS S3 bucket):

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  finalizers:
    - delete-backup
  name: backup1
spec:
  clusterName: mongodb-percona-cluster
  storageName: eu-central-1
  type: logical

Drop collections on MongoDB replicaset (just to avoid the _id clashes at next point);

Now ask for a restore of the above backup with the following yml (this works as intended since I saw the logs and the data inside MongoDB ReplicaSet):

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: restore1
spec:
  clusterName: mongodb-percona-cluster
  backupName: backup1

Ask for another backup with the following yml (keep in mind that at this point the previous restore process is still in progress)

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  finalizers:
    - delete-backup
  name: backup2
spec:
  clusterName: mongodb-percona-cluster
  storageName: eu-central-1
  type: logical

The backup2 will be put on Status=Waiting;

At this point Kubernetes scheduler should kill the backup-agent container from the MongoDB replica pod because of memory issues and restart it;

Now if you do a kubectl get psmdb-backup, you'll see that backup2 is in Error status and if you do a kubectl get psmdb-restore, you'll see that restore1 is also in Error status (OK, I can take that);

From this point onwards, no backup/restore will be possible through any yml, because they'll be appended as Status=Waiting.

The new backup-agent container logs state that it is waiting for incoming requests:

2024/03/05 16:36:01 [entrypoint] starting `pbm-agent`
2024-03-05T16:36:05.000+0000 I pbm-agent:
Version:   2.3.0
Platform:  linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19
2024-03-05T16:36:05.000+0000 I starting PITR routine
2024-03-05T16:36:05.000+0000 I node: rs0/mongodb-percona-cluster-rs0-0.mongodb-percona-cluster-rs0.default.svc.cluster.local:27017
2024-03-05T16:36:05.000+0000 I listening for the commands

Versions

  1. Kubernetes version v1.27.9 in a 8 nodes cluster with 4GB of RAM each, in Azure Cloud
  2. Operator image percona/percona-server-mongodb-operator:1.15.0
  3. Database image percona/percona-server-mongodb:5.0.20-17

Anything else?

Same bug applies also on cronjobs (so it's not an issue triggered by the on demand backup/restore requests): they are kept in Waiting status. The bug does NOT happen when using a ReplicaSet with at least 3 replicas (the default topology).

spron-in commented 6 months ago

Nice catch @AlcipPopa . @hors I think we had something in our backlog about it. Thoughts?