vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.44k stars 2.08k forks source link

RFC: incremental backup and point-in-time recovery #11227

Open shlomi-noach opened 2 years ago

shlomi-noach commented 2 years ago

We wish to implement a native solution for (offline) incremental backup and compatible point-in-time recovery in Vitess. There is already a Work in Progress PR. But let's first describe the problem, what's the offered solution, and how it differs from an already existing prior implementation.

Background

Point-in-time recoveries make it possible to recover a database into a specific or rough, timestamp or position. The classic use case is a catastrophic change to the data, e.g. an unintentional DELETE FROM <table> or similar. Normally the damage only applies to a subset of the data, and the database is generally still valid, and the app is still able to function. As such, we want to fix the specific damage inflicted. The flow is to restore the data on an offline/non-serving server, to a point in time immediately before the damage was done. It's then typically a manual process of salvaging the specific damaged records.

It's also possible to just throw away everything and roll back the entire database to that point in time, though that is an uncommon use case.

A point in time can be either an actual timestamp, or, more accurately, a position. Specifically in MySQL 5.7 and above, this will be a GTID set, the @@gtid_executed just before the damage. Since every transaction gets its own GTID value, it should be possible to restore up to a single transaction granularity (where a timestamp is a more coarse measurement).

A point in time recovery is possible by combining a full backup recovery, followed by an incremental stream of changes since that backup. There are two main techniques in three different forms:

  1. Using binary logs, stored offline
  2. Using a binary log live stream
  3. Using Xtrabackup incremental backup

This RFC wishes to address (1). There is already prior work for (2). Right now we do not wish to address (3).

The existing prior work addresses (2), and specifically assumes:

Suggested solution, backup

We wish to implement a more general solution by actually backing up binary logs as part of the backup process. These can be stored on local disk, in S3, etc., same way as any vitess backup is stored. In fact, an incremental backup will be listed just like any other backup, and this listing is also the key to performing a restore.

The user will take an incremental backup similarly to how they take a full backup:

An incremental backup needs to have a starting point, given as --incremental_from_pos flag. The incremental backup must cover that position, but does not have to start exactly at that position: it can start with an earlier position. See diagram below. The backup ends with the rough position of the time the backup was requested. It will cover the exact point in time where the request was made, and possibly extend slightly beyond that.

An incremental backup is taken by copying binary logs. To do that, there is no need to shut down the MySQL server, and it is free to be fully operational and serve traffic while backup takes place. The backup process will rotate binary logs (FLUSH BINARY LOGS) so as to ensure the files it is backing up are safely immutable.

A manifest of an incremental backup may look like so:

{
  "BackupMethod": "builtin",
  "Position": "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-883",
  "FromPosition": "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-867",
  "Incremental": true,
  "BackupTime": "2022-08-25T12:55:05Z",
  "FinishedTime": "2022-08-25T12:55:05Z",
  "ServerUUID": "1ea0631b-22b6-11ed-933f-0a43f95f28a3",
  "TabletAlias": "zone1-0000000102",
  "CompressionEngine": "pargzip",
  "FileEntries": [
     ..
  ]
}

Suggested solution, restore/recovery

Again, riding the familiar Restore command. A restore looks like:

vtctlclient -- RestoreFromBackup  --restore_to_pos  "MySQL56/16b1039f-22b6-11ed-b765-0a43f95f28a3:1-10000" zone1-0000000102

Vitess will attempt to find a path that recovers the database to that point in time. The path consists of exactly one full backup, followed by zero or more incremental restores. There could be exactly one such path, there could be multiple paths, or there could be no path. Consider the following scenarios:

Recovery scenario 1

point-in-time-recovery-path-1

This is the classic scenario. A full backup takes place at e.g. 12:10, then an incremental backup taken from exactly that point and is valid to 13:20, then the next one from exactly that point, valid to 16:15, etc.

To restore the database to e.g. 20:00 (let's assume that's at position 16b1039f-22b6-11ed-b765-0a43f95f28a3:1-10000), we will restore the full backup, followed by incrementals 1 -> 2 -> 3 -> 4. Note that 4 exceeds 20:00 and vitess will only apply changes up to 20:00, or to be more precise, up to 16b1039f-22b6-11ed-b765-0a43f95f28a3:1-10000.

Recovery scenario 2

point-in-time-recovery-path-2

The above is actually identical to the first scenario. Notice how the first incremental backup precedes the full backup, and how backups 2 % 3 overlap. This is fine! We take strong advantage of MySQL's GTIDs. Because the overlapping transactions in 2 and 3 are consistently identified by same GTIDs, MySQL is able to ignore the duplicates as we apply both restores one after the other.

Recovery scenario 3

point-in-time-recovery-path-3

In the above we have four different paths for recovery!

Either is valid, Vitess should choose however it pleases. Ideally using as fewest backups as possible (hence preferring 2nd or 4th options).

Recovery scenario 4

If we wanted to restore up to 22:15, then, there's no incremental backup that can take us there, and the operation must fail before it event begins.

Finding paths

Vitess should be able to determine the recovery path before even actually applying anything. It is able to do so by reading the available manifests, finding the shortest valid path to a requested point in time. By a greedy algorithm, it will seek the most recent full backup at or before requested time, and then the shortest sequence of incremental backups to take us to that point.

Backups from multiple sources

Scenario (3) looks imaginary, until you consider backups may be taken from different tablets. These have different binary logs at different rotation time -- but all share the same sequence of GTIDs. Since an incremental backup consists of full binary log copies, there could be overlaps between binary logs backed up from different tablets/MySQL servers.

Vitess should not care about the identity of the sources, should not care about the binary log names (one server's binlog.0000289 may come before another server's binlog.0000101), should not care about binary log count. It should only care about the GTID range an incremental backup covers: from (exclusive) and to (inclusive)

Restore time

It should be notes that an incremental restore based on binary logs means sequentially applying changes to a server. This make take minutes or hours, depending on how many binary log events we need to apply.

Testing

As usual, testing is to take place in:

Thoughts welcome. Please see https://github.com/vitessio/vitess/pull/11097 for Work In Progress.

deepthi commented 2 years ago

Nicely written proposal. A few questions/comments:

shlomi-noach commented 2 years ago

In terms of topology, the restored tablet is part of the same existing keyspace/shard. Correct?

Correct, and we need the logic to prevent it from auto-replicating.

Restored tablet will not be serving, and most likely will be lagging.

:+1:

Any alerts that might be generated by this situation are the responsibility of the (human) operator to work around.

:+1:

shlomi-noach commented 2 years ago

In terms of topology, the restored tablet is part of the same existing keyspace/shard. Correct?

Thinking more on this, I'm not sure which is the preferred way: use the same keyspace or create a new keyspace. Using the same keyspace leads to the risk of the server unintentionally getting attached to th ereplication stream. In fact, that's what's happening in my dev env right now: I make a point in time restore, and then vitess auto-configures the restored server to replicate -- even though I skip the replication configuration in the restore process.

Is there a way to forcefully prevent the restored server from joining the replication stream?

shlomi-noach commented 2 years ago

The current implementation now reuses same keyspace, sets tablet type to DRAINED and ensures not to start eplication.

mattlord commented 1 year ago

The current implementation now reuses same keyspace, sets tablet type to DRAINED and ensures not to start eplication.

Re-using the same keyspace seems more logical to me at first thought. You can prevent tablet repair in a number of standard ways:

  1. Set the tablet type to BACKUP or RESTORE, RESTORE seems more relevant than DRAINED
  2. touching tabletmanager.replicationStoppedFile
  3. I think using --disable_active_reparents

@GuptaManan100 would know better

GuptaManan100 commented 1 year ago

BACKUP is meant for tablets that are in midst of taking backups and RESTORE for the ones that are being restored. I am not sure how we use DRAINED. From the VTOrc perspective, all three are ignored so if we use any of the three, then VTOrc won't repair replication on them. The same goes for the replication manager too, it won't fix replication. So we shouldn't need to add the replication stopped file.

If we do want to disable the replication manager explicitly (even though in my opinion it shouldn't be required), then there is a new flag that was added recently - disable-replication-manager.

One thing that could be an issue is the setting of replication parameters by the tablet manager when it first starts. We can prevent that from happening with disable_active_reparents as @mattlord pointed out. We could also fix this step by checking the tablet type as we do for the other two, so we won't need this flag either. I can make that change if we decide to go with this alternative.

I looked at the linked PR and I think that has all the changes that should be needed. There is already code to stop the vtctld from setting up replication after the restore is complete and also code to prevent the restore flow itself on the vttablets to not start replication. Since we set the type to DRAINED in the end, neither of VTOrc nor the replication manager should be repairing replication.

@shlomi-noach Do you know where the replication is fixed by vitess in your tests? I don't think there is any other place, other than the 3 mentioned ☝️ that repair replication. I can help debug if we are seeing that replication is being repaired after the restore.

EDIT: Looked at the test in the PR and it is using a replica tablet that is already running to run the recovery process, so the initialization code shouldn't matter either.

shlomi-noach commented 1 year ago

Do you know where the replication is fixed by vitess in your tests?

@GuptaManan100 I don't think they are? The PITR tests are all good and validate that replication does not get fixed.

@mattlord like @GuptaManan100 said, I think BACKUP and RESTORE types are for actively-being-backed-up-or-restored tablets. I think DRAINED makes most sense because by default vitess will not serve any traffic from DRAINED, but will allow explicit connections to read from the tablet mysql -h @drained ...

GuptaManan100 commented 1 year ago

@shlomi-noach Okay great! I was looking at

I make a point in time restore, and then vitess auto-configures the restored server to replicate -- even though I skip the replication configuration in the restore process.

which made me think that replication was being repaired by something in Vitess, even though I wasn't expecting it to be. Maybe you made code changes after that comment which resolved it.

And I agree thatDRAINED should be the ideal type to use given our alternatives

shlomi-noach commented 1 year ago

which made me think that replication was being repaired by something in Vitess

Sorry I wasn't clear. There was this problem and I found where it was that forced replication to start. It was as part of the Restore process itself, in vtctld.

GuptaManan100 commented 1 year ago

Oh! I see, I had added that in response to an issue wherein if there was a PRS while a tablet was in restore state, its semi-sync settings weren't set up correctly when it finally transitioned to replica back. The changes in your PR as far as this flow is concerned, are perfect 💯

derekperkins commented 1 year ago

Link to a prior discussion

shlomi-noach commented 1 year ago

A few words about https://github.com/vitessio/vitess/pull/13156: this PR supports incremental backup/recovery for Xtrabackup. It does use binary logs, as with the builtin engine. It does not use Xtrabackup's incremental backup where it copies InnoDB pages.

With https://github.com/vitessio/vitess/pull/13156, it is possible to run full & incremental backups using Xtrabackup engine, without taking down a MySQL server.

The incremental restore process is similar to that of builtin: take down a server, restore a full backup, start the server, then apply binary logs as appropriate.

https://github.com/vitessio/vitess/pull/13156 is merges in release-17.0.

Note that this still only supports --restore-to-pos, which means:

Support for a point in time (as in, restore to a given timestamp) will be added next.

shlomi-noach commented 1 year ago

Supporting a point-in-time recovery:

We want to be able to recover one or all shards up to a specified point-in-time, ie a timestamp. We want to be able to restore to any point in time in a 1sec resolution. We will technically be able to restore to a microsecond resolution, but for now let's discuss a 1 second resolution.

Whether we restore a single shard or multiple shards, the operation will take place independently on each shard.

When restoring multiple shards to the same point in time, the user should be aware that the shards may not be in full sync with each other. Time granularity, clock skews etc., can all mean that the restored shards may not be 100% consistent with an actual historical point-in-time.

As for the algorithm we will go by: it's a bit different than a restore-to-position because:

The most reliable information we have is the original committed timestamp value in the binary log. This event header remains true to the primary, even if read from a replica.

The way to run a point-in-time recovery is a bit upside-down from restore-to-pos:

  1. We will first find an incremental backup whose binlog entries range (first and last binlog entry in the backup) represet a timestamp range that includes our desired point-in-time to restore
  2. Then, we will work backwards to find previous incremental backups, and until we hit a full backup, which are all recoverable in sequence (ie no GTID gaps)
  3. But, also, since we support incremental backups with Xtrabackup, it is possible that the full backup overlaps with one or more incremental backup binary logs (binary logs are not rotated in a Xtrabackup backup). We also require that the full backup must have been completed before our desired point-in-time to restore.
  4. Once we have such a sequence: full backup, incremental backups, leading to a timestamp that is higher than our desired point-in-time, we apply as follows:
    • Restore the full backup. We know that it is true to before our desired point-in-time
    • Restore all incremental backups in the sequence. In each incremental backup we restore all binary logs
    • As reminder, we allow binlog overlaps, as we rely on MySQL GTID to skip duplicate transactions
    • In extracting any of the binary logs, we will use mysqlbinlog --stop-datetime to ensure no event gets applied that is later than our desired point-in-time

I tend to require the user to supply the point-in-time strictly in UTC, but we can work this out. Everything is possible, of course, but I wonder what is more correct UX-wise.

shlomi-noach commented 1 year ago

WIP for restore-to-time: https://github.com/vitessio/vitess/pull/13270

shlomi-noach commented 1 year ago

It's been pointed out by the community (Vitess slack, #feat-backup channel) as well as by @deepthi that the current flow for PITR requires taking out an existing replica. The request is to be able to initialize a new replica with PITR flags, so that it is created and restored from backup with PITR all in one go.

PR incoming.