Open ankitthakwani opened 1 year ago
Interesting observation, when I manually fire the correct START SLAVE SQL_THREAD UNTIL SQL_BEFORE_GTIDS command then PITR process catches up
mysql> START SLAVE SQL_THREAD UNTIL SQL_BEFORE_GTIDS = 'c35c3322-eb10-11ed-aeb1-3c5731cc0ff6:153';
Query OK, 0 rows affected, 1 warning (0.08 sec)
mysql> SHOW SLAVE STATUS\G
Empty set, 1 warning (0.00 sec)
mysql> SHOW MASTER STATUS\G
*************************** 1. row ***************************
File: vt-0000000909-bin.000009
Position: 3227
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: c35c3322-eb10-11ed-aeb1-3c5731cc0ff6:1-152,
f7393495-eb11-11ed-8cf4-3c5731cc0ff6:1
1 row in set (0.00 sec)
mysql> SET time_zone = '+00:00';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM txnInfo;
+----+----------------------------+
| id | created |
+----+----------------------------+
| 1 | 2023-05-08 10:07:26.649175 |
| 2 | 2023-05-08 10:07:26.683593 |
| 3 | 2023-05-08 10:07:26.700026 |
| 4 | 2023-05-08 10:07:26.767802 |
| 5 | 2023-05-08 10:07:26.792034 |
| 6 | 2023-05-08 10:07:26.804216 |
| 7 | 2023-05-08 10:07:26.807450 |
| 8 | 2023-05-08 10:07:26.862235 |
| 9 | 2023-05-08 10:07:26.910654 |
| 10 | 2023-05-08 10:07:26.967557 |
+----+----------------------------+
10 rows in set (0.00 sec)
I manually fired the above START SLAVE SQL_THREAD UNTIL SQL_BEFORE_GTIDS = 'c35c3322-eb10-11ed-aeb1-3c5731cc0ff6:153'; just before I0508 17:34:05.212090 timestamp, see the timestamps in following vttablet log
I0508 15:42:39.204094 52987 query.go:81] exec CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=17900, MASTER_USER='vt_repl_pitr', MASTER_PASSWORD='Password@123', MASTER_AUTO_POSITION=1;
I0508 15:42:39.304614 52987 query.go:81] exec START SLAVE UNTIL SQL_BEFORE_GTIDS = 'f7393495-eb11-11ed-8cf4-3c5731cc0ff6:1'
I0508 15:42:39.315523 52987 restore.go:486] Waiting for position to reach%!(EXTRA string=f7393495-eb11-11ed-8cf4-3c5731cc0ff6:1)
I0508 17:34:05.212090 52987 query.go:81] exec STOP SLAVE
I0508 17:34:05.225934 52987 query.go:81] exec RESET SLAVE ALL
I0508 17:34:05.314984 52987 restore.go:304] Restore: changing tablet type to REPLICA for cell:"dc1" uid:909
I0508 17:34:05.315011 52987 tm_state.go:186] Changing Tablet Type: REPLICA for cell:"dc1" uid:909
I0508 17:34:05.315105 52987 tm_state.go:381] Publishing state: alias:{cell:"dc1" uid:909} hostname:"127.0.0.1" port_map:{key:"grpc" value:16909} port_map:{key:"vt" value:15909} keyspace:"pitr_restored" shard:"0" type:REPLICA db_name_override:"vt_pitr" mysql_hostname:"127.0.0.1" mysql_port:17909 db_server_version:"8.0.29" default_conn_collation:255
I0508 17:34:05.352438 52987 tm_state.go:101] In tmState.Open()
I0508 17:34:05.352487 52987 shard_sync.go:79] Change to tablet state
I0508 17:34:05.352659 52987 restore.go:136] No vttablet_restore_done hook.
I0508 17:34:05.353294 52987 updatestreamctl.go:228] Enabling update stream, dbname: vt_pitr
I0508 17:34:05.353354 52987 state_manager.go:221] Starting transition to REPLICA Serving, timestamp: 0001-01-01 00:00:00 +0000 UTC
I0508 17:34:05.353430 52987 tablegc.go:206] TableGC - started execution of Close. Acquiring initMutex lock
I0508 17:34:05.353436 52987 tablegc.go:209] TableGC - acquired lock
I0508 17:34:05.353441 52987 tablegc.go:211] TableGC - no collector is open
I0508 17:34:05.353456 52987 engine.go:90] messager Engine - started execution of Close. Acquiring mu lock
I0508 17:34:05.353462 52987 engine.go:92] messager Engine - acquired mu lock
I0508 17:34:05.353467 52987 engine.go:95] messager Engine is not open
I0508 17:34:05.355010 52987 engine.go:227] Schema Engine: opening
I0508 17:34:05.447053 52987 engine.go:502] schema engine created [txnInfo], altered [], dropped []
I0508 17:34:05.447102 52987 engine.go:157] VStreamer: opening
I0508 17:34:05.447116 52987 query_engine.go:270] Query Engine: opening
I0508 17:34:05.448824 52987 tx_engine.go:157] TxEngine transition: AcceptingReadOnly
I0508 17:34:05.448878 52987 stateful_connection_pool.go:77] Starting transaction id: {1683540737351698135}
I0508 17:34:05.448938 52987 repltracker.go:108] Replication Tracker: going into non-primary mode
I0508 17:34:05.457069 52987 state_manager.go:611] TabletServer transition: REPLICA: Not connected to mysql -> REPLICA: Serving for tablet :pitr_restored/0
I0508 17:34:05.459683 52987 state_manager.go:604] Tablet Init took 6708117 ms
I0508 17:34:05.459696 52987 state_manager.go:711] State: exiting lameduck
I0508 17:34:05.459704 52987 tm_state.go:381] Publishing state: alias:{cell:"dc1" uid:909} hostname:"127.0.0.1" port_map:{key:"grpc" value:16909} port_map:{key:"vt" value:15909} keyspace:"pitr_restored" shard:"0" type:REPLICA db_name_override:"vt_pitr" mysql_hostname:"127.0.0.1" mysql_port:17909 db_server_version:"8.0.29" default_conn_collation:255
I think the issue is at https://github.com/vitessio/vitess/blob/fd81a3a4c6e7d563c13a9e552fe25222caedc9ca/go/vt/vttablet/tabletmanager/restore.go#L449
func (tm *TabletManager) catchupToGTID(ctx context.Context, afterGTIDPos string, beforeGTIDPos string) error {
var afterGTIDStr string
if afterGTIDPos != "" {
afterGTIDParsed, err := mysql.DecodePosition(afterGTIDPos)
if err != nil {
return err
}
afterGTIDStr = afterGTIDParsed.GTIDSet.Last()
}
I cannot understand why is afterGTIDParsed.GTIDSet.Last() being used here. The changing GTID can be the one appearing earlier in GTID set for e.g. c35c3322-eb10-11ed-aeb1-3c5731cc0ff6:1-162,f7393495-eb11-11ed-8cf4-3c5731cc0ff6:1
Overview of the Issue
PITR not working as described in https://vitess.io/docs/16.0/reference/features/recovery/
After restoring the backup, the state is not getting rolled forward as described
Reproduction Steps
Steps to reproduce
Create a single shard keyspace - "pitr" and start correspoding vttablets and mysql
Take backup ( I am using compressed and streamed xtrabackup and saving it on S3), , lets call is DS0(Database state 0)
vtctlclient --server 127.0.0.1:15999 BackupShard pitr/0
Insert some data in the primary instance to increase GTID counter, lets call is DS1
Make a note of the current timestamp
Insert some data in the primary instance to increase GTID counter, lets call is DS2
As per the documents in https://vitess.io/docs/16.0/reference/features/recovery/, create a new snapshot keyspace - pitr_restored, use the timestamp from Point 4
vtctldclient --server 127.0.0.1:15999 CreateKeyspace --type=SNAPSHOT --base-keyspace=pitr --snapshot-timestamp=2023-05-05T09:33:00Z pitr_restored
Start the corresponding vttablet and mysql for pitr_restored keyspace
Expected database restore state to be DS1 but it is DS0 which means Binary log apply is not working as expected.
Following are logs from the steps above on my system, demonstrating the issue
Start the mysql and vttablet of pitr_restore keyspace, following are the logs
In above it is clear to see that it is stuck waiting for f7393495-eb11-11ed-8cf4-3c5731cc0ff6:1 which is not correct.
Binary Version
Operating System and Environment details
Log Fragments
No response