vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.47k stars 2.09k forks source link

Bug Report: Race condition on uploading backup and manifest #16825

Open rvrangel opened 5 days ago

rvrangel commented 5 days ago

Overview of the Issue

We have encountered a rare issue where we got a backup done where only the MANIFEST was written to S3, but not the actual backup file!

Looking at the logs, it seems we tried to complete the upload process of backup.xbstream.gz after the MANIFEST was written, braking the contract highlighted here.

Reproduction Steps

not easy to reproduce since it depends on S3 throttling us, but might be possible to write a test that simulates this kind of behaviour.

Binary Version

running v15 from our Slack branch

Operating System and Environment details

not OS related.

Log Fragments

I0920 19:04:40.356237 3431799 xtrabackupengine.go:357] xtrabackup stderr: 2024-09-20T19:04:40.355928-07:00 0 [Note] [MY-011825] [Xtrabackup] completed OK!
I0920 19:04:40.750777 3431799 xtrabackupengine.go:709] Found position: <<redacted>>
I0920 19:04:40.750838 3431799 xtrabackupengine.go:146] Closing backup file backup.xbstream.gz
I0920 19:04:40.750850 3431799 xtrabackupengine.go:201] Writing backup MANIFEST
I0920 19:04:40.751397 3431799 xtrabackupengine.go:237] Backup completed
I0920 19:04:40.751421 3431799 xtrabackupengine.go:146] Closing backup file MANIFEST
W0920 19:04:41.425825 3431799 rpc_server.go:80] TabletManager.Backup(concurrency:4)(on us_east_1c-0169388481 from ) error: MultipartUpload: upload multipart failed
\tupload id: <<redacted>>
caused by: Throttling: Rate exceeded
\tstatus code: 400, request id: <<redacted>>
deepthi commented 4 days ago

@frouioui looks like you might have run into a similar issue and fixed it in #16806?

EDIT: Taking another look, they look like different race conditions.