psmdb-backup irregularly fails to complete due to various reasons

michaelSchmidMaloon commented 2 months ago

Report

We're running several psmdb clusters in multiple kubernetes clusters. For most of them the psmdb-backup failed at least once during their lifetime. We have three psmdb clusters where the backup irregularly fails around once a week.

We found a multitude of seemingly underlying issues that we can't track down to infrastructure or faulty configuration.

We can't find indicators for when a backup will fail and can't see differences when a backup fails versus it working for multiple days back to back again.

The affected clusters differ is storage size and throughput, from some 10MB to a few GB. But nothing in the size of > 100GB.

We first noticed this problem in October 2023.

pbm-agent hangs

To debug the faulty pods, we open a shell to the pbm-agent container and issue pbm commands. Sometimes the pbm-agent is unresponsive and needs to be killed from inside the container with

kill -2 $(grep -E "^Pid:" $(grep -l pbm-agent /proc/*[0-9]/status) | rev | cut -f1 | rev)

Election winner does nothing

Logs of the backup-agent container state that an election has stated for which agent is responsible for creating a backup. One gets elected and the others respect that, thus not starting a backup. Unfortunately sometimes the elected container also does not start the backup.

Stale lock in database

Mostly in combination with the above issue, sometime backup processes stop during execution and leave a stale lock in the database, preventing subsequent backup jobs from creating new backups.

Starting deadline exceeded

Other times the backup-agent logs Error: starting deadline exceeded also often creating a stale lock in the database.

Steps to reproduce

Create psmdb-cluster (3 nodes for our use-case) with psmdb-operator
Configure daily backups to be written to cloud provider (e.g. AWS S3)
Wait some time
psmdb-backup will fail

Versions

Kubernetes v1.28.11
Operator chart version 1.16.2 / 1.16.3
Database 1.16.3

Anything else?

No response

michaelSchmidMaloon commented 2 months ago

Tonight's backup failed with

[...]
2024-09-02T10:07:28.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66d58445d58df8df2b89d091]
2024/09/02 10:07:35 [#################.......]  <database>.<collection>  136762220/190760640  (71.7%)
2024-09-02T10:07:35.760+0000    Mux close namespace <database>.<collection>
2024-09-02T10:07:35.760+0000    Mux finish
2024-09-02T10:07:35.760+0000    archive writer: error writing data for collection `<database>.<collection>` to disk: error writing to file: short write / io: read/write on closed pipe
2024-09-02T10:07:35.000+0000 I [backup/2024-09-02T09:24:21Z] dropping tmp collections
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] created chunk 2024-09-02T10:04:25 - 2024-09-02T10:07:35
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark RS as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
    upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark backup as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
    upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 E [backup/2024-09-02T09:24:21Z] backup: mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
    upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )
2024-09-02T10:07:36.000+0000 D [backup/2024-09-02T09:24:21Z] releasing lock

michaelswierszcz commented 2 months ago

seeing this as well. we had upgraded our clusters and operator from v1.15 -> v1.16 and these intermittent backup failure's started happening daily.

percona-backup-mongodb 2.3.0 -> 2.4.1

michaelswierszcz commented 2 months ago

reverted back 2.4.1 -> 2.3.0 and all my backups started working again. clusters and operator remain on v1.16

hors commented 2 months ago

Tonight's backup failed with

[...]
2024-09-02T10:07:28.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66d58445d58df8df2b89d091]
2024/09/02 10:07:35 [#################.......]  <database>.<collection>  136762220/190760640  (71.7%)
2024-09-02T10:07:35.760+0000  Mux close namespace <database>.<collection>
2024-09-02T10:07:35.760+0000  Mux finish
2024-09-02T10:07:35.760+0000  archive writer: error writing data for collection `<database>.<collection>` to disk: error writing to file: short write / io: read/write on closed pipe
2024-09-02T10:07:35.000+0000 I [backup/2024-09-02T09:24:21Z] dropping tmp collections
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] created chunk 2024-09-02T10:04:25 - 2024-09-02T10:07:35
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark RS as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
  upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark backup as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
  upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 E [backup/2024-09-02T09:24:21Z] backup: mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
  upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )
2024-09-02T10:07:36.000+0000 D [backup/2024-09-02T09:24:21Z] releasing lock

It is difficult to understand root of the issue. Maybe it some storage connection problem. I can suggest to increase numMaxRetries

asemarian commented 1 month ago

Same problem. Though in my case they're failing for the exact same reason: starting deadline exceeded. The last ~7 backups have all failed in the same way. I'm using v1.16.1 of the operator and database server running on Kubernetes v1.30.5.

michaelswierszcz commented 1 month ago

quick update - i've switched back to v2.6.0 and added cpu and memory to my node, haven't had a failure since

hors commented 1 week ago

@michaelswierszcz @asemarian PSMDBO 1.18.0 with PBM 2.7.0 was released yesterday. PBM 2.6.0/2.7.0 has a lot of bug fixes.

percona / percona-server-mongodb-operator