vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.71k stars 2.1k forks source link

RFC: Restorability Checks in `vtbackup` #17240

Closed frouioui closed 3 days ago

frouioui commented 1 week ago

Introduction

This RFC proposes an enhancement to vtbackup by adding two new phases to the backup process. The first phase involves backing up a single file reading it after writing it to ensure hashes match. The second one, an end-to-end restorability check that restores the new backup to ensure its restorability. The goal of this enhancement is to verify the integrity and restorability of each backup. These new phases should reduce the risk of unexpected failures during recovery, specifically errors that are not caught by our backup engine. These bugs or errors can happen on either side: Vitess or the storage layer, and may be the result of faulty code, unexpected behavior, or network/storage issues.

Existing Implementation

Currently, vtbackup creates new backups without verifying their restorability. The process is as follows:

  1. Bring up a new vtbackup process
  2. Restore that mysqld process from the latest backup, if there is any
  3. Wait for the mysqld process to catch up on the primary
  4. Take a new backup
  5. Prune old backups (optional, depending on flags)
  6. Exit

While this is efficient, it lacks a mechanism to ensure that new backups are valid and restorable. Issues in our backup process can happen at any time and for any reason: a bug introduced by a developer, network issue, issue on the storage layer (AWS, GCP, or else). A recent example that impacted some deployments of Vitess is https://github.com/vitessio/vitess/issues/17063. A bug was introduced in the builtinbackup engine, causing backups to report success even when they failed.

Problematic

As detailed in the section above, the current implementation lacks a restorability check both at a per-file level and at a more global level. Which can lead to a backup reporting itself as successful even if it is not, leading to potential issues in a production deployment:

Existing Solutions

This section covers what already exists in other DDBS.

CockroachDB

CockroachDB provides several query statements that enable its users to check the validity of a backup. It will check the validity of the backups at different levels: cluster, keyspace or per-table. They made it possible to check the restorability of the schema, this ensures the schema can be restored, not the actual data. (https://www.cockroachlabs.com/docs/stable/backup-validation)

Cassandra

Cassandra has a tool named Medusa that handles backups and restores. This tool has a verify command that will ensure the hash stored in the Manifest match the ones of the files stored.

Proposed Solution

The solution is divided in two phases: end-to-end restorability check, and a per-file check (read-after-write). This section will detail both phases:

End-to-End Restorability Check

Currently, taking a backup is the last step before pruning existing backups and exiting vtbackup. The end-to-end reliability check would come right after taking a backup:

This check will reuse the mysqlctl process that gets created early on and will use the same restore logic that already exists in the code base, which facilitates the implementation.

Read-After-Write

We currently write each file from inside a goroutine, limited to a certain amount of concurrent files with the --concurrency flag. Writing the file include several steps including: getting an io.Reader pointing to the file on disk, getting a io.Writer which is given by the BackupHandle, copying the content from the io.Reader to the io.Writer while logging some stats and computing a hash as we write things to the io.Writer. Finally, we fetch a hash for the file we wrote, which we will use later to create the manifest file.

The implementation I am proposing is to compare the hash we compute while writing to the file and compare it with the hash we receive after reading the file again. We would execute a lightweight version of restoreFile that does not actually write the file to the filesystem and log the stats, but that would do everything else: decompressing and reading from the storage layer. Once we have fully read the file, we get a hash that we can directly compare with the hash computed while writing the file. If the hashes do not match, the current file would be marked as 'failed', failing the entire backup process.

Consideration was given to whether we should read the file in another set of goroutine with its own concurrency limit (could be the same as the one used to write), but I think that would make the implementation a lot more complex as we would need to retro-validate files after they have been written. Moreover, I think we could document this clearly and suggest that users increase their concurrency limit (--concurrency), if they opt-in for this extra verification step.

Implementation

Implementing these two proposed solution will require the addition of at least one more flag to enable/disable these checks. But I would find it preferable to add two new flags to enable/disable each option as they both have their advantages and disadvantages. Both flags will be disabled by default.

Trade-Offs

The two solutions proposed do come with a significant impact on performance. The total runtime for vtbackup will increase due to the following: we will now need to read back every file for hash verification, we will perform a full restore after the backup is complete.

Resource usage will also increase as we will need to use more compute resource, for both the vtbackup process and the storage layer.

These trade-offs can be mitigated by making each check optional, users can enable them only for critical keyspaces, and by documententing the trade-offs in our docs.

Benefits

These two new checks will improve the reliability of our backups, without requiring manual testing from users. It should also give more confidence to end-users running Vitess in production where backups play a crucial role. The problems described above should be limited with these two checks enabled.

frouioui commented 3 days ago

After talking it over offline, this will not be implemented for several reasons:

  1. New bugs, similar to https://github.com/vitessio/vitess/issues/17063, can be introduced in the restore code path which will make the entire restorability check faulty and unreliable.
  2. The problems listed above about backup and binlog retentions, are not under the responsibility of Vitess.
  3. If a vtbackup job fails and is left unseen, the same thing could happen even with a restorability check, which would still expose the problems listed.
  4. The code complexity that would bring these proposed solutions is too great compared to the real benefit. And, the performance and time added to a vtbackup job will also outweigh the benefits.

I am currently working on another enhancement that will help make backups (with the builtin backup engine) faster in case of failures: https://github.com/vitessio/vitess/issues/17259.