RFC: incremental backup and point-in-time recovery

shlomi-noach commented 2 years ago

We wish to implement a native solution for (offline) incremental backup and compatible point-in-time recovery in Vitess. There is already a Work in Progress PR. But let's first describe the problem, what's the offered solution, and how it differs from an already existing prior implementation.

Background

Point-in-time recoveries make it possible to recover a database into a specific or rough, timestamp or position. The classic use case is a catastrophic change to the data, e.g. an unintentional DELETE FROM <table> or similar. Normally the damage only applies to a subset of the data, and the database is generally still valid, and the app is still able to function. As such, we want to fix the specific damage inflicted. The flow is to restore the data on an offline/non-serving server, to a point in time immediately before the damage was done. It's then typically a manual process of salvaging the specific damaged records.

It's also possible to just throw away everything and roll back the entire database to that point in time, though that is an uncommon use case.

A point in time can be either an actual timestamp, or, more accurately, a position. Specifically in MySQL 5.7 and above, this will be a GTID set, the @@gtid_executed just before the damage. Since every transaction gets its own GTID value, it should be possible to restore up to a single transaction granularity (where a timestamp is a more coarse measurement).

A point in time recovery is possible by combining a full backup recovery, followed by an incremental stream of changes since that backup. There are two main techniques in three different forms:

This RFC wishes to address (1). There is already prior work for (2). Right now we do not wish to address (3).

The existing prior work addresses (2), and specifically assumes:

You have a binlog server in your topology
The binlog server still has all the required binary logs to perform the recovery
You are able to join your server into the live replication stream

Testing

As usual, testing is to take place in:

Unit tests (e.g. validate recovery path logic)
endtoend (validate incremental backup, validate point in time restore)

Thoughts welcome. Please see https://github.com/vitessio/vitess/pull/11097 for Work In Progress.

deepthi commented 2 years ago

Nicely written proposal. A few questions/comments:

In terms of topology, the restored tablet is part of the same existing keyspace/shard. Correct?
Assuming that is correct, the restored tablet will have its replication stopped at the desired position and will not attempt to connect to the shard primary, right?
Restored tablet will not be serving, and most likely will be lagging.
Any alerts that might be generated by this situation are the responsibility of the (human) operator to work around.

shlomi-noach commented 2 years ago

In terms of topology, the restored tablet is part of the same existing keyspace/shard. Correct?

Correct, and we need the logic to prevent it from auto-replicating.

Restored tablet will not be serving, and most likely will be lagging.

:+1:

Any alerts that might be generated by this situation are the responsibility of the (human) operator to work around.

:+1:

shlomi-noach commented 2 years ago

In terms of topology, the restored tablet is part of the same existing keyspace/shard. Correct?

Thinking more on this, I'm not sure which is the preferred way: use the same keyspace or create a new keyspace. Using the same keyspace leads to the risk of the server unintentionally getting attached to th ereplication stream. In fact, that's what's happening in my dev env right now: I make a point in time restore, and then vitess auto-configures the restored server to replicate -- even though I skip the replication configuration in the restore process.

Is there a way to forcefully prevent the restored server from joining the replication stream?

shlomi-noach commented 2 years ago

The current implementation now reuses same keyspace, sets tablet type to DRAINED and ensures not to start eplication.

mattlord commented 1 year ago

The current implementation now reuses same keyspace, sets tablet type to DRAINED and ensures not to start eplication.

Re-using the same keyspace seems more logical to me at first thought. You can prevent tablet repair in a number of standard ways:

Set the tablet type to BACKUP or RESTORE, RESTORE seems more relevant than DRAINED
touching tabletmanager.replicationStoppedFile
I think using --disable_active_reparents

@GuptaManan100 would know better

GuptaManan100 commented 1 year ago

BACKUP is meant for tablets that are in midst of taking backups and RESTORE for the ones that are being restored. I am not sure how we use DRAINED. From the VTOrc perspective, all three are ignored so if we use any of the three, then VTOrc won't repair replication on them. The same goes for the replication manager too, it won't fix replication. So we shouldn't need to add the replication stopped file.

If we do want to disable the replication manager explicitly (even though in my opinion it shouldn't be required), then there is a new flag that was added recently - disable-replication-manager.

One thing that could be an issue is the setting of replication parameters by the tablet manager when it first starts. We can prevent that from happening with disable_active_reparents as @mattlord pointed out. We could also fix this step by checking the tablet type as we do for the other two, so we won't need this flag either. I can make that change if we decide to go with this alternative.

I looked at the linked PR and I think that has all the changes that should be needed. There is already code to stop the vtctld from setting up replication after the restore is complete and also code to prevent the restore flow itself on the vttablets to not start replication. Since we set the type to DRAINED in the end, neither of VTOrc nor the replication manager should be repairing replication.

@shlomi-noach Do you know where the replication is fixed by vitess in your tests? I don't think there is any other place, other than the 3 mentioned ☝️ that repair replication. I can help debug if we are seeing that replication is being repaired after the restore.

EDIT: Looked at the test in the PR and it is using a replica tablet that is already running to run the recovery process, so the initialization code shouldn't matter either.

shlomi-noach commented 1 year ago

Do you know where the replication is fixed by vitess in your tests?

@GuptaManan100 I don't think they are? The PITR tests are all good and validate that replication does not get fixed.

@mattlord like @GuptaManan100 said, I think BACKUP and RESTORE types are for actively-being-backed-up-or-restored tablets. I think DRAINED makes most sense because by default vitess will not serve any traffic from DRAINED, but will allow explicit connections to read from the tablet mysql -h @drained ...

GuptaManan100 commented 1 year ago

@shlomi-noach Okay great! I was looking at

I make a point in time restore, and then vitess auto-configures the restored server to replicate -- even though I skip the replication configuration in the restore process.

which made me think that replication was being repaired by something in Vitess, even though I wasn't expecting it to be. Maybe you made code changes after that comment which resolved it.

And I agree thatDRAINED should be the ideal type to use given our alternatives

shlomi-noach commented 1 year ago

which made me think that replication was being repaired by something in Vitess

Sorry I wasn't clear. There was this problem and I found where it was that forced replication to start. It was as part of the Restore process itself, in vtctld.

GuptaManan100 commented 1 year ago

Oh! I see, I had added that in response to an issue wherein if there was a PRS while a tablet was in restore state, its semi-sync settings weren't set up correctly when it finally transitioned to replica back. The changes in your PR as far as this flow is concerned, are perfect 💯

derekperkins commented 1 year ago

Link to a prior discussion

https://github.com/vitessio/vitess/issues/3581

shlomi-noach commented 1 year ago

A few words about https://github.com/vitessio/vitess/pull/13156: this PR supports incremental backup/recovery for Xtrabackup. It does use binary logs, as with the builtin engine. It does not use Xtrabackup's incremental backup where it copies InnoDB pages.

With https://github.com/vitessio/vitess/pull/13156, it is possible to run full & incremental backups using Xtrabackup engine, without taking down a MySQL server.

The incremental restore process is similar to that of builtin: take down a server, restore a full backup, start the server, then apply binary logs as appropriate.

https://github.com/vitessio/vitess/pull/13156 is merges in release-17.0.

Note that this still only supports --restore-to-pos, which means:

You need to know the specific GTID position to which you want to restore
And that is per-shard, as each shard will have completely different GTID sets

Support for a point in time (as in, restore to a given timestamp) will be added next.

shlomi-noach commented 1 year ago

Supporting a point-in-time recovery:

We want to be able to recover one or all shards up to a specified point-in-time, ie a timestamp. We want to be able to restore to any point in time in a 1sec resolution. We will technically be able to restore to a microsecond resolution, but for now let's discuss a 1 second resolution.

Whether we restore a single shard or multiple shards, the operation will take place independently on each shard.

When restoring multiple shards to the same point in time, the user should be aware that the shards may not be in full sync with each other. Time granularity, clock skews etc., can all mean that the restored shards may not be 100% consistent with an actual historical point-in-time.

As for the algorithm we will go by: it's a bit different than a restore-to-position because:

Positions are discrete, where time isn't. More accurately, positions are uniquely identifiable, and two or more transactions can share the same timestamp
Positions are logical, and are independent of hardware clock
It is therefore impossible, or undesirable to claim "this full backup is true to this precise time". If it's an online backup, then we'd have to freeze writes so as to get the current time, and even then be susceptible to clock skew. If we take a backup from a replica, who knows what the exact time on the primary the replica image reflects? We can estimate with e.g. heartbeats, but the fact remains the value is inaccurate.

The most reliable information we have is the original committed timestamp value in the binary log. This event header remains true to the primary, even if read from a replica.

The way to run a point-in-time recovery is a bit upside-down from restore-to-pos:

We will first find an incremental backup whose binlog entries range (first and last binlog entry in the backup) represet a timestamp range that includes our desired point-in-time to restore
Then, we will work backwards to find previous incremental backups, and until we hit a full backup, which are all recoverable in sequence (ie no GTID gaps)
But, also, since we support incremental backups with Xtrabackup, it is possible that the full backup overlaps with one or more incremental backup binary logs (binary logs are not rotated in a Xtrabackup backup). We also require that the full backup must have been completed before our desired point-in-time to restore.
Once we have such a sequence: full backup, incremental backups, leading to a timestamp that is higher than our desired point-in-time, we apply as follows:
- Restore the full backup. We know that it is true to before our desired point-in-time
- Restore all incremental backups in the sequence. In each incremental backup we restore all binary logs
- As reminder, we allow binlog overlaps, as we rely on MySQL GTID to skip duplicate transactions
- In extracting any of the binary logs, we will use mysqlbinlog --stop-datetime to ensure no event gets applied that is later than our desired point-in-time

I tend to require the user to supply the point-in-time strictly in UTC, but we can work this out. Everything is possible, of course, but I wonder what is more correct UX-wise.

shlomi-noach commented 1 year ago

WIP for restore-to-time: https://github.com/vitessio/vitess/pull/13270

shlomi-noach commented 1 year ago

It's been pointed out by the community (Vitess slack, #feat-backup channel) as well as by @deepthi that the current flow for PITR requires taking out an existing replica. The request is to be able to initialize a new replica with PITR flags, so that it is created and restored from backup with PITR all in one go.

PR incoming.

vitessio / vitess

RFC: incremental backup and point-in-time recovery #11227

Background

Suggested solution, backup

Suggested solution, restore/recovery

Recovery scenario 1

Recovery scenario 2

Recovery scenario 3

Recovery scenario 4

Finding paths

Backups from multiple sources

Restore time

Testing