Closed frouioui closed 3 days ago
After talking it over offline, this will not be implemented for several reasons:
I am currently working on another enhancement that will help make backups (with the builtin backup engine) faster in case of failures: https://github.com/vitessio/vitess/issues/17259.
Introduction
This RFC proposes an enhancement to
vtbackup
by adding two new phases to the backup process. The first phase involves backing up a single file reading it after writing it to ensure hashes match. The second one, an end-to-end restorability check that restores the new backup to ensure its restorability. The goal of this enhancement is to verify the integrity and restorability of each backup. These new phases should reduce the risk of unexpected failures during recovery, specifically errors that are not caught by our backup engine. These bugs or errors can happen on either side: Vitess or the storage layer, and may be the result of faulty code, unexpected behavior, or network/storage issues.Existing Implementation
Currently,
vtbackup
creates new backups without verifying their restorability. The process is as follows:vtbackup
processmysqld
process from the latest backup, if there is anymysqld
process to catch up on the primaryWhile this is efficient, it lacks a mechanism to ensure that new backups are valid and restorable. Issues in our backup process can happen at any time and for any reason: a bug introduced by a developer, network issue, issue on the storage layer (AWS, GCP, or else). A recent example that impacted some deployments of Vitess is https://github.com/vitessio/vitess/issues/17063. A bug was introduced in the builtinbackup engine, causing backups to report success even when they failed.
Problematic
As detailed in the section above, the current implementation lacks a restorability check both at a per-file level and at a more global level. Which can lead to a backup reporting itself as successful even if it is not, leading to potential issues in a production deployment:
vtbackup
calls will restore from this invalid backup and fail. Leading to no new backups being created, if left unseen for sometime, we may reach a point where the latest valid backup is removed or too old for the current binary log.Existing Solutions
This section covers what already exists in other DDBS.
CockroachDB
CockroachDB provides several query statements that enable its users to check the validity of a backup. It will check the validity of the backups at different levels: cluster, keyspace or per-table. They made it possible to check the restorability of the schema, this ensures the schema can be restored, not the actual data. (https://www.cockroachlabs.com/docs/stable/backup-validation)
Cassandra
Cassandra has a tool named Medusa that handles backups and restores. This tool has a
verify
command that will ensure the hash stored in the Manifest match the ones of the files stored.Proposed Solution
The solution is divided in two phases: end-to-end restorability check, and a per-file check (read-after-write). This section will detail both phases:
End-to-End Restorability Check
Currently, taking a backup is the last step before pruning existing backups and exiting
vtbackup
. The end-to-end reliability check would come right after taking a backup:vtbackup
will ask themysqlctl
process (created earlier in the process) to perform a second restore using the new backup.vtbackup
will mark the new backup as successful and exit.vtbackup
will be able to log what went wrong in the restore. Which will give end-users a proactive feedback on what could happen with this backup if they tried to restore it.This check will reuse the
mysqlctl
process that gets created early on and will use the same restore logic that already exists in the code base, which facilitates the implementation.Read-After-Write
We currently write each file from inside a goroutine, limited to a certain amount of concurrent files with the
--concurrency
flag. Writing the file include several steps including: getting anio.Reader
pointing to the file on disk, getting aio.Writer
which is given by theBackupHandle
, copying the content from theio.Reader
to theio.Writer
while logging some stats and computing a hash as we write things to theio.Writer
. Finally, we fetch a hash for the file we wrote, which we will use later to create the manifest file.The implementation I am proposing is to compare the hash we compute while writing to the file and compare it with the hash we receive after reading the file again. We would execute a lightweight version of
restoreFile
that does not actually write the file to the filesystem and log the stats, but that would do everything else: decompressing and reading from the storage layer. Once we have fully read the file, we get a hash that we can directly compare with the hash computed while writing the file. If the hashes do not match, the current file would be marked as 'failed', failing the entire backup process.Consideration was given to whether we should read the file in another set of goroutine with its own concurrency limit (could be the same as the one used to write), but I think that would make the implementation a lot more complex as we would need to retro-validate files after they have been written. Moreover, I think we could document this clearly and suggest that users increase their concurrency limit (
--concurrency
), if they opt-in for this extra verification step.Implementation
Implementing these two proposed solution will require the addition of at least one more flag to enable/disable these checks. But I would find it preferable to add two new flags to enable/disable each option as they both have their advantages and disadvantages. Both flags will be disabled by default.
Trade-Offs
The two solutions proposed do come with a significant impact on performance. The total runtime for
vtbackup
will increase due to the following: we will now need to read back every file for hash verification, we will perform a full restore after the backup is complete.Resource usage will also increase as we will need to use more compute resource, for both the
vtbackup
process and the storage layer.These trade-offs can be mitigated by making each check optional, users can enable them only for critical keyspaces, and by documententing the trade-offs in our docs.
Benefits
These two new checks will improve the reliability of our backups, without requiring manual testing from users. It should also give more confidence to end-users running Vitess in production where backups play a crucial role. The problems described above should be limited with these two checks enabled.