vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.23k stars 2.07k forks source link

Bug Report: incremental backup & restore: failure to take incremental backups in a multi tablet scenario #13517

Closed shlomi-noach closed 11 months ago

shlomi-noach commented 1 year ago

Overview of the Issue

In a cluster with multiple REPLICA/RDONLY tablets, it's possible to create a situation where vtctlclient -- Backup --incremental_from_pos=auto fails to take the backup.

This gist of the scenario is if one of the tablets is restored from backup (which wipes out its binary logs, setting gtid_purged), takes incremental backup (runs fine), and then an attempt is made to take incremental backup on the other tablet.

Reproduction Steps

Use examples/local. Assume:

Run the following sequence. Note that the interleaved ApplySchema commands are there just to generate sufficient changelog in between the operations.

vtctlclient -- Backup zone1-0000000102
vtctldclient ApplySchema --ddl-strategy="vitess" --sql "alter table corder force" commerce && sleep 2
vtctlclient -- Backup --incremental_from_pos=auto zone1-0000000102
vtctldclient ApplySchema --ddl-strategy="vitess" --sql "alter table corder force" commerce && sleep 2
vtctldclient RestoreFromBackup zone1-0000000102
vtctldclient ApplySchema --ddl-strategy="vitess" --sql "alter table corder force" commerce && sleep 2
vtctlclient -- Backup --incremental_from_pos=auto zone1-0000000102
vtctldclient ApplySchema --ddl-strategy="vitess" --sql "alter table corder force" commerce && sleep 2
vtctlclient -- Backup --incremental_from_pos=auto zone1-0000000100

The last --incremental_from_pos=auto zone1-0000000100 commands yields with something similar to:

I0717 07:47:43.728526 2090851 main.go:96] I0717 07:47:43.728145 backup.go:110] I0717 07:47:43.727878 builtinbackupengine.go:202] Executing Backup at 2023-07-17 07:47:43.727768003 +0000 UTC m=+217.129511829 for keyspace/shard commerce/0 on tablet zone1-0000000100, concurrency: 4, compress: true, incrementalFromPos: auto
I0717 07:47:43.741621 2090851 main.go:96] I0717 07:47:43.741426 backup.go:110] I0717 07:47:43.741189 builtinbackupengine.go:260] auto evaluating incremental_from_pos
I0717 07:47:43.742018 2090851 main.go:96] I0717 07:47:43.741901 backup.go:110] I0717 07:47:43.741720 builtinbackupengine.go:279] auto evaluated incremental_from_pos: MySQL56/b696e26a-2475-11ee-9d38-0a43f95f28a3:563-571
E0717 07:47:43.765510 2090851 main.go:96] E0717 07:47:43.765311 backup.go:110] E0717 07:47:43.765064 backup.go:163] backup is not usable, aborting it: [Code: FAILED_PRECONDITION
Mismatching GTID entries. Requested backup pos has entries not found in the binary logs, and binary logs have entries not found in the requested backup pos. Neither fully contains the other. Requested pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:563-571, binlog pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:1-269

cannot get binary logs to backup in incremental backup]
Backup Error: rpc error: code = Unknown desc = TabletManager.Backup on zone1-0000000100 error: cannot get binary logs to backup in incremental backup: Mismatching GTID entries. Requested backup pos has entries not found in the binary logs, and binary logs have entries not found in the requested backup pos. Neither fully contains the other. Requested pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:563-571, binlog pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:1-269: cannot get binary logs to backup in incremental backup: Mismatching GTID entries. Requested backup pos has entries not found in the binary logs, and binary logs have entries not found in the requested backup pos. Neither fully contains the other. Requested pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:563-571, binlog pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:1-269
E0717 07:47:43.790505 2090851 main.go:105] remote error: rpc error: code = Unknown desc = TabletManager.Backup on zone1-0000000100 error: cannot get binary logs to backup in incremental backup: Mismatching GTID entries. Requested backup pos has entries not found in the binary logs, and binary logs have entries not found in the requested backup pos. Neither fully contains the other. Requested pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:563-571, binlog pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:1-269: cannot get binary logs to backup in incremental backup: Mismatching GTID entries. Requested backup pos has entries not found in the binary logs, and binary logs have entries not found in the requested backup pos. Neither fully contains the other. Requested pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:563-571, binlog pos=b696e26a-2475-11ee-9d38-0a43f95f28a3:1-269

The last successful incremental backup on 102 is:

{
  "BackupMethod": "builtin",
  "Position": "MySQL56/b696e26a-2475-11ee-9d38-0a43f95f28a3:563-571",
  "PurgedPosition": "MySQL56/b696e26a-2475-11ee-9d38-0a43f95f28a3:1-562",
  "FromPosition": "MySQL56/b696e26a-2475-11ee-9d38-0a43f95f28a3:1-562",
  "Incremental": true,
  "BackupTime": "2023-07-17T07:47:43Z",
  "FinishedTime": "2023-07-17T07:47:43Z",
  "ServerUUID": "34bb1d4c-2476-11ee-85a9-0a43f95f28a3",
  "TabletAlias": "zone1-0000000102",
  "Keyspace": "commerce",
  "Shard": "0",
  "MySQLVersion": "/home/shlomi/opt/mysql/8.0.23/bin/mysqld  Ver 8.0.23 for Linux on x86_64 (Source distribution)\n",
  "UpgradeSafe": false,
  "CompressionEngine": "pargzip",
  "FileEntries": [
    {
      "Base": "BinLog",
      "Name": "vt-0000000102-bin.000001",
      "Hash": "4925e8df",
      "ParentPath": ""
    }
  ],
  "SkipCompress": false,
  "ExternalDecompressor": ""
}

The issue is we do not calculate gtid_purged correctly.

Binary Version

v17, v18

Operating System and Environment details

-

Log Fragments

No response

shlomi-noach commented 11 months ago

Addressed by https://github.com/vitessio/vitess/pull/13555 with a series of endtoend tests that reproduce the error scenario (but of course now pass given the fix in the PR). Also a bunch of unit tests. In general the entire fix is one line.