Open michaelSchmidMaloon opened 2 months ago
Tonight's backup failed with
[...]
2024-09-02T10:07:28.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66d58445d58df8df2b89d091]
2024/09/02 10:07:35 [#################.......] <database>.<collection> 136762220/190760640 (71.7%)
2024-09-02T10:07:35.760+0000 Mux close namespace <database>.<collection>
2024-09-02T10:07:35.760+0000 Mux finish
2024-09-02T10:07:35.760+0000 archive writer: error writing data for collection `<database>.<collection>` to disk: error writing to file: short write / io: read/write on closed pipe
2024-09-02T10:07:35.000+0000 I [backup/2024-09-02T09:24:21Z] dropping tmp collections
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] created chunk 2024-09-02T10:04:25 - 2024-09-02T10:07:35
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark RS as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark backup as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 E [backup/2024-09-02T09:24:21Z] backup: mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )
2024-09-02T10:07:36.000+0000 D [backup/2024-09-02T09:24:21Z] releasing lock
seeing this as well. we had upgraded our clusters and operator from v1.15 -> v1.16 and these intermittent backup failure's started happening daily.
percona-backup-mongodb 2.3.0 -> 2.4.1
reverted back 2.4.1 -> 2.3.0 and all my backups started working again. clusters and operator remain on v1.16
Tonight's backup failed with
[...] 2024-09-02T10:07:28.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66d58445d58df8df2b89d091] 2024/09/02 10:07:35 [#################.......] <database>.<collection> 136762220/190760640 (71.7%) 2024-09-02T10:07:35.760+0000 Mux close namespace <database>.<collection> 2024-09-02T10:07:35.760+0000 Mux finish 2024-09-02T10:07:35.760+0000 archive writer: error writing data for collection `<database>.<collection>` to disk: error writing to file: short write / io: read/write on closed pipe 2024-09-02T10:07:35.000+0000 I [backup/2024-09-02T09:24:21Z] dropping tmp collections 2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] created chunk 2024-09-02T10:04:25 - 2024-09-02T10:07:35 2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark RS as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed upload id: <upload-id> caused by: RequestError: send request failed caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil> 2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark backup as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed upload id: <upload-id> caused by: RequestError: send request failed caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil> 2024-09-02T10:07:36.000+0000 E [backup/2024-09-02T09:24:21Z] backup: mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed upload id: <upload-id> caused by: RequestError: send request failed caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out ) 2024-09-02T10:07:36.000+0000 D [backup/2024-09-02T09:24:21Z] releasing lock
It is difficult to understand root of the issue. Maybe it some storage connection problem. I can suggest to increase numMaxRetries
Same problem. Though in my case they're failing for the exact same reason: starting deadline exceeded
. The last ~7 backups have all failed in the same way. I'm using v1.16.1 of the operator and database server running on Kubernetes v1.30.5.
quick update - i've switched back to v2.6.0 and added cpu and memory to my node, haven't had a failure since
@michaelswierszcz @asemarian PSMDBO 1.18.0 with PBM 2.7.0 was released yesterday. PBM 2.6.0/2.7.0 has a lot of bug fixes.
Report
We're running several psmdb clusters in multiple kubernetes clusters. For most of them the psmdb-backup failed at least once during their lifetime. We have three psmdb clusters where the backup irregularly fails around once a week.
We found a multitude of seemingly underlying issues that we can't track down to infrastructure or faulty configuration.
We can't find indicators for when a backup will fail and can't see differences when a backup fails versus it working for multiple days back to back again.
The affected clusters differ is storage size and throughput, from some 10MB to a few GB. But nothing in the size of > 100GB.
We first noticed this problem in October 2023.
More about the problem
pbm-agent hangs
To debug the faulty pods, we open a shell to the pbm-agent container and issue
pbm
commands. Sometimes the pbm-agent is unresponsive and needs to be killed from inside the container withElection winner does nothing
Logs of the backup-agent container state that an election has stated for which agent is responsible for creating a backup. One gets elected and the others respect that, thus not starting a backup. Unfortunately sometimes the elected container also does not start the backup.
Stale lock in database
Mostly in combination with the above issue, sometime backup processes stop during execution and leave a stale lock in the database, preventing subsequent backup jobs from creating new backups.
Starting deadline exceeded
Other times the backup-agent logs
Error: starting deadline exceeded
also often creating a stale lock in the database.Steps to reproduce
Versions
Anything else?
No response