RFC: Restorability Checks in `vtbackup`

Introduction

This RFC proposes an enhancement to vtbackup by adding two new phases to the backup process. The first phase involves backing up a single file reading it after writing it to ensure hashes match. The second one, an end-to-end restorability check that restores the new backup to ensure its restorability. The goal of this enhancement is to verify the integrity and restorability of each backup. These new phases should reduce the risk of unexpected failures during recovery, specifically errors that are not caught by our backup engine. These bugs or errors can happen on either side: Vitess or the storage layer, and may be the result of faulty code, unexpected behavior, or network/storage issues.

Existing Implementation

Currently, vtbackup creates new backups without verifying their restorability. The process is as follows:

Bring up a new vtbackup process
Restore that mysqld process from the latest backup, if there is any
Wait for the mysqld process to catch up on the primary
Take a new backup
Prune old backups (optional, depending on flags)
Exit

While this is efficient, it lacks a mechanism to ensure that new backups are valid and restorable. Issues in our backup process can happen at any time and for any reason: a bug introduced by a developer, network issue, issue on the storage layer (AWS, GCP, or else). A recent example that impacted some deployments of Vitess is https://github.com/vitessio/vitess/issues/17063. A bug was introduced in the builtinbackup engine, causing backups to report success even when they failed.

Problematic

As detailed in the section above, the current implementation lacks a restorability check both at a per-file level and at a more global level. Which can lead to a backup reporting itself as successful even if it is not, leading to potential issues in a production deployment:

Cascading Failures: If a bad backup is created and reported as successful, subsequent vtbackup calls will restore from this invalid backup and fail. Leading to no new backups being created, if left unseen for sometime, we may reach a point where the latest valid backup is removed or too old for the current binary log.
Disaster Recovery Risks: During urgent recovery operations, relying on the most recent backups is critical. An undetected issue in the latest backup could delay the entire recovery process. And, if older backups have been removed due to retention policies or their positions are no longer covered by the binary log, recovery becomes even more complex.

Existing Solutions

This section covers what already exists in other DDBS.

CockroachDB

CockroachDB provides several query statements that enable its users to check the validity of a backup. It will check the validity of the backups at different levels: cluster, keyspace or per-table. They made it possible to check the restorability of the schema, this ensures the schema can be restored, not the actual data. (https://www.cockroachlabs.com/docs/stable/backup-validation)

Cassandra

Cassandra has a tool named Medusa that handles backups and restores. This tool has a verify command that will ensure the hash stored in the Manifest match the ones of the files stored.

Proposed Solution

The solution is divided in two phases: end-to-end restorability check, and a per-file check (read-after-write). This section will detail both phases:

End-to-End Restorability Check

Currently, taking a backup is the last step before pruning existing backups and exiting vtbackup. The end-to-end reliability check would come right after taking a backup:

Restore Verification:
- vtbackup will ask the mysqlctl process (created earlier in the process) to perform a second restore using the new backup.
- This will mimic what an operator would manually do if they wanted to ensure the validity of the new backup.
Feedback:
- This process will offer a definitive answer about whether the new backup is fully restorable or not.
- If the restore succeeds, vtbackup will mark the new backup as successful and exit.
- If the restore fails, vtbackup will be able to log what went wrong in the restore. Which will give end-users a proactive feedback on what could happen with this backup if they tried to restore it.

This check will reuse the mysqlctl process that gets created early on and will use the same restore logic that already exists in the code base, which facilitates the implementation.

Read-After-Write

We currently write each file from inside a goroutine, limited to a certain amount of concurrent files with the --concurrency flag. Writing the file include several steps including: getting an io.Reader pointing to the file on disk, getting a io.Writer which is given by the BackupHandle, copying the content from the io.Reader to the io.Writer while logging some stats and computing a hash as we write things to the io.Writer. Finally, we fetch a hash for the file we wrote, which we will use later to create the manifest file.

The implementation I am proposing is to compare the hash we compute while writing to the file and compare it with the hash we receive after reading the file again. We would execute a lightweight version of restoreFile that does not actually write the file to the filesystem and log the stats, but that would do everything else: decompressing and reading from the storage layer. Once we have fully read the file, we get a hash that we can directly compare with the hash computed while writing the file. If the hashes do not match, the current file would be marked as 'failed', failing the entire backup process.

Consideration was given to whether we should read the file in another set of goroutine with its own concurrency limit (could be the same as the one used to write), but I think that would make the implementation a lot more complex as we would need to retro-validate files after they have been written. Moreover, I think we could document this clearly and suggest that users increase their concurrency limit (--concurrency), if they opt-in for this extra verification step.

Implementation

Implementing these two proposed solution will require the addition of at least one more flag to enable/disable these checks. But I would find it preferable to add two new flags to enable/disable each option as they both have their advantages and disadvantages. Both flags will be disabled by default.

Trade-Offs

The two solutions proposed do come with a significant impact on performance. The total runtime for vtbackup will increase due to the following: we will now need to read back every file for hash verification, we will perform a full restore after the backup is complete.

Resource usage will also increase as we will need to use more compute resource, for both the vtbackup process and the storage layer.

These trade-offs can be mitigated by making each check optional, users can enable them only for critical keyspaces, and by documententing the trade-offs in our docs.

Benefits

These two new checks will improve the reliability of our backups, without requiring manual testing from users. It should also give more confidence to end-users running Vitess in production where backups play a crucial role. The problems described above should be limited with these two checks enabled.

vitessio / vitess