vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.7k stars 2.1k forks source link

`BuiltinBackupEngine`: Retry failed files #17259

Open frouioui opened 2 days ago

frouioui commented 2 days ago

Current Implementation

When taking a backup or restoring with the builtin backup engine today, if any of the file fails to be read/written, the entire process will fail and we will mark the backup as unusable. See code. Some of the errors we can encounter when backing up / restoring a file can be network issues, storage layer issues, or other things that are not in the user's control. Failing the entire process when such error happen is not ideal, specially for long running backups or restores that can take a few hours to complete. Taking transient network issues as an example, retrying the file will likely succeed.

Proposed Enhancement

Ideally we would have a per-file retry mechanism that would retry a file once if it failed. Once we detected a failure we would check if we have not reached the maximum number of retries yet: if we have, we record the error and cancel the operation as we do today, if we have not, we call be.backupFile once more after logging that we are retrying the given file. The file system storage, and the other ones (GCS, S3, etc) do not work by appending to an existing object, instead it will override the value if we add a file again using the same key, so there is no need to "clear" what was previously written to the backend storage.

The maximum amount of retries should be a constant set to 1.

This enhancement will only be valid for the builtin backup engine, for the xtrabackup and mysqlsh engines we directly start the underlying process with no control over how they handle each file.

Trade-Offs

This enhancement brings a small trade-offs: potentially longer backups/restores if at least one file will always fail. If a file is bound to fail no matter what we will retry it once, making the process a bit longer, before canceling the entire process and failing out. On the other-hand, if a file will only fail once, we retry it, the file passes, and the process will succeed instead of failing.