Open reillski opened 1 week ago
You're referencing an Aurora v2 instance. Bump yourself up to v3 and try again and you'll likely run into the problems others are.
select @@aurora_binlog_replication_max_yield_seconds,@@aurora_enable_repl_bin_log_filtering,@@aurora_version;
+-----------------------------------------------+----------------------------------------+------------------+
| @@aurora_binlog_replication_max_yield_seconds | @@aurora_enable_repl_bin_log_filtering | @@aurora_version |
+-----------------------------------------------+----------------------------------------+------------------+
| 0 | 1 | 3.06.0 |
+-----------------------------------------------+----------------------------------------+------------------+
For me, gh-ost fails on Aurora v3
Thanks @rbanks54
I just tested on 3.06 and 3.07 and the cutover was successful. Using default parameters apart from binlog_format=row
. Only way I can reliably repro is using aurora_binlog_replication_max_yield_seconds
on 2.x. Tried using gh-ost versions: gh-ost-1.1.6-1
and gh-ost-1.1.5-1
mysql> select @@aurora_binlog_replication_max_yield_seconds,@@aurora_enable_repl_bin_log_filtering,@@aurora_version;
+-----------------------------------------------+----------------------------------------+------------------+
| @@aurora_binlog_replication_max_yield_seconds | @@aurora_enable_repl_bin_log_filtering | @@aurora_version |
+-----------------------------------------------+----------------------------------------+------------------+
| 0 | 1 | 3.06.0 |
+-----------------------------------------------+----------------------------------------+------------------+
1 row in set (0.00 sec)
$ gh-ost --max-load=Threads_running=25 --critical-load=Threads_running=1000 --chunk-size=1000 --max-lag-millis=1500 --user="user" --password="password" --host=<host>.us-east-1.rds.amazonaws.com --allow-on-master --database="sysbench" --table="sbtest1" --verbose --alter="engine=innodb" --switch-to-rbr --allow-master-master --cut-over=default --exact-rowcount --concurrent-rowcount --default-retries=120 --panic-flag-file=/tmp/ghost.panic.flag --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag --initially-drop-old-table --execute
2024-06-19 05:03:20 INFO starting gh-ost 1.1.5
2024-06-19 05:03:20 INFO Migrating `sysbench`.`sbtest1`
2024-06-19 05:03:20 INFO inspector connection validated on <host>.us-east-1.rds.amazonaws.com:3306
2024-06-19 05:03:20 INFO User has REPLICATION CLIENT, REPLICATION SLAVE privileges, and has ALL privileges on `sysbench`.*
2024-06-19 05:03:20 INFO binary logs validated on <host>.us-east-1.rds.amazonaws.com:3306
2024-06-19 05:03:20 INFO Restarting replication on <host>.us-east-1.rds.amazonaws.com:3306 to make sure binlog settings apply to replication thread
2024-06-19 05:03:20 INFO Inspector initiated on ip-172-17-5-173:3306, version 8.0.34
2024-06-19 05:03:20 INFO Table found. Engine=InnoDB
2024-06-19 05:03:20 INFO Estimated number of rows via EXPLAIN: 247316
2024-06-19 05:03:20 INFO Recursively searching for replication master
2024-06-19 05:03:20 INFO Master found to be ip-172-17-5-173:3306
2024-06-19 05:03:20 INFO log_slave_updates validated on <host>.us-east-1.rds.amazonaws.com:3306
2024-06-19 05:03:20 INFO streamer connection validated on <host>.us-east-1.rds.amazonaws.com:3306
2024-06-19 05:03:20 INFO Connecting binlog streamer at mysql-bin-changelog.000005:97727481
[2024/06/19 05:03:20] [info] binlogsyncer.go:133 create BinlogSyncer with config {99999 mysql <host>.us-east-1.rds.amazonaws.com 3306 user false false <nil> false UTC true 0 0s 0s 0 false}
[2024/06/19 05:03:20] [info] binlogsyncer.go:354 begin to sync binlog from position (mysql-bin-changelog.000005, 97727481)
[2024/06/19 05:03:20] [info] binlogsyncer.go:203 register slave for master server <host>.us-east-1.rds.amazonaws.com:3306
2024-06-19 05:03:20 INFO applier connection validated on <host>.us-east-1.rds.amazonaws.com:3306
2024-06-19 05:03:20 INFO applier connection validated on <host>.us-east-1.rds.amazonaws.com:3306
2024-06-19 05:03:20 INFO will use time_zone='SYSTEM' on applier
2024-06-19 05:03:20 INFO Examining table structure on applier
2024-06-19 05:03:20 INFO Applier initiated on ip-172-17-5-173:3306, version 8.0.34
2024-06-19 05:03:20 INFO Dropping table `sysbench`.`_sbtest1_del`
[2024/06/19 05:03:20] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000005, 97727481)
2024-06-19 05:03:20 INFO rotate to next log from mysql-bin-changelog.000005:0 to mysql-bin-changelog.000005
2024-06-19 05:03:20 INFO Table dropped
2024-06-19 05:03:20 INFO Dropping table `sysbench`.`_sbtest1_ghc`
2024-06-19 05:03:20 INFO Table dropped
2024-06-19 05:03:20 INFO Creating changelog table `sysbench`.`_sbtest1_ghc`
2024-06-19 05:03:20 INFO Changelog table created
2024-06-19 05:03:20 INFO Creating ghost table `sysbench`.`_sbtest1_gho`
2024-06-19 05:03:20 INFO Ghost table created
2024-06-19 05:03:20 INFO Altering ghost table `sysbench`.`_sbtest1_gho`
2024-06-19 05:03:20 INFO Ghost table altered
2024-06-19 05:03:20 INFO Altering ghost table AUTO_INCREMENT value `sysbench`.`_sbtest1_gho`
2024-06-19 05:03:20 INFO Ghost table AUTO_INCREMENT altered
2024-06-19 05:03:20 INFO Created postpone-cut-over-flag-file: /tmp/ghost.postpone.flag
2024-06-19 05:03:20 INFO Waiting for ghost table to be migrated. Current lag is 0s
2024-06-19 05:03:20 INFO Intercepted changelog state GhostTableMigrated
2024-06-19 05:03:20 INFO Handled changelog state GhostTableMigrated
2024-06-19 05:03:20 INFO Chosen shared unique key is PRIMARY
2024-06-19 05:03:20 INFO Shared columns are id,k,c,pad
2024-06-19 05:03:20 INFO Listening on unix socket file: /tmp/gh-ost.sysbench.sbtest1.sock
2024-06-19 05:03:20 INFO As instructed, counting rows in the background; meanwhile I will use an estimated count, and will update it later on
2024-06-19 05:03:20 INFO As instructed, I'm issuing a SELECT COUNT(*) on the table. This may take a while
2024-06-19 05:03:20 INFO Migration min values: [1]
2024-06-19 05:03:20 INFO Migration max values: [250000]
2024-06-19 05:03:20 INFO Waiting for first throttle metrics to be collected
2024-06-19 05:03:20 INFO Exact number of rows via COUNT: 249998
2024-06-19 05:03:20 INFO First throttle metrics collected
# Migrating `sysbench`.`sbtest1`; Ghost table is `sysbench`.`_sbtest1_gho`
# Migrating ip-172-17-5-173:3306; inspecting ip-172-17-5-173:3306; executing on i-09fce0c3dab1a3eb2.ec2.internal
# Migration started at Wed Jun 19 05:03:20 +0000 2024
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=25; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag [set]
# panic-flag-file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.sysbench.sbtest1.sock
Copy: 0/249998 0.0%; Applied: 0; Backlog: 0/1000; Time: 0s(total), 0s(copy); streamer: mysql-bin-changelog.000005:97921758; Lag: 0.02s, HeartbeatLag: 0.02s, State: migrating; ETA: N/A
2024-06-19 05:03:20 INFO Intercepted changelog state ReadMigrationRangeValues
2024-06-19 05:03:20 INFO Handled changelog state ReadMigrationRangeValues
Copy: 33990/250092 13.6%; Applied: 417; Backlog: 0/1000; Time: 1s(total), 1s(copy); streamer: mysql-bin-changelog.000005:105687523; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 6s
Copy: 66972/250177 26.8%; Applied: 826; Backlog: 4/1000; Time: 2s(total), 2s(copy); streamer: mysql-bin-changelog.000005:113281935; Lag: 0.01s, HeartbeatLag: 0.12s, State: migrating; ETA: 5s
Copy: 98944/250228 39.5%; Applied: 1208; Backlog: 5/1000; Time: 3s(total), 3s(copy); streamer: mysql-bin-changelog.000005:120652666; Lag: 0.01s, HeartbeatLag: 0.12s, State: migrating; ETA: 4s
Copy: 129909/250285 51.9%; Applied: 1629; Backlog: 13/1000; Time: 4s(total), 4s(copy); streamer: mysql-bin-changelog.000005:127882787; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 3s
[2024/06/19 05:03:25] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000006, 4)
[2024/06/19 05:03:25] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000006, 4)
2024-06-19 05:03:25 INFO rotate to next log from mysql-bin-changelog.000006:134290721 to mysql-bin-changelog.000006
2024-06-19 05:03:25 INFO rotate to next log from mysql-bin-changelog.000006:0 to mysql-bin-changelog.000006
Copy: 157851/250334 63.1%; Applied: 2030; Backlog: 3/1000; Time: 5s(total), 5s(copy); streamer: mysql-bin-changelog.000006:179404; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 2s
Copy: 185780/250375 74.2%; Applied: 2469; Backlog: 8/1000; Time: 6s(total), 6s(copy); streamer: mysql-bin-changelog.000006:6933020; Lag: 0.01s, HeartbeatLag: 0.12s, State: migrating; ETA: 2s
Copy: 213698/250403 85.3%; Applied: 2910; Backlog: 1/1000; Time: 7s(total), 7s(copy); streamer: mysql-bin-changelog.000006:13614396; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 1s
Copy: 243611/250410 97.3%; Applied: 3297; Backlog: 7/1000; Time: 8s(total), 8s(copy); streamer: mysql-bin-changelog.000006:20681642; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: due
2024-06-19 05:03:28 INFO Row copy complete
Copy: 249584/249584 100.0%; Applied: 3392; Backlog: 0/1000; Time: 8s(total), 8s(copy); streamer: mysql-bin-changelog.000006:22142335; Lag: 0.01s, HeartbeatLag: 0.04s, State: migrating; ETA: due
Copy: 249584/249584 100.0%; Applied: 3392; Backlog: 387/1000; Time: 9s(total), 8s(copy); streamer: mysql-bin-changelog.000006:23264031; Lag: 0.01s, HeartbeatLag: 0.12s, State: postponing cut-over; ETA: due
Copy: 249584/249584 100.0%; Applied: 4078; Backlog: 224/1000; Time: 10s(total), 8s(copy); streamer: mysql-bin-changelog.000006:25137638; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249584/249584 100.0%; Applied: 4558; Backlog: 214/1000; Time: 11s(total), 8s(copy); streamer: mysql-bin-changelog.000006:26805088; Lag: 0.01s, HeartbeatLag: 0.12s, State: postponing cut-over; ETA: due
Copy: 249584/249584 100.0%; Applied: 4885; Backlog: 389/1000; Time: 12s(total), 8s(copy); streamer: mysql-bin-changelog.000006:28456583; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249584/249584 100.0%; Applied: 5592; Backlog: 216/1000; Time: 13s(total), 8s(copy); streamer: mysql-bin-changelog.000006:30251766; Lag: 0.01s, HeartbeatLag: 0.12s, State: postponing cut-over; ETA: due
Copy: 249584/249584 100.0%; Applied: 6272; Backlog: 79/1000; Time: 14s(total), 8s(copy); streamer: mysql-bin-changelog.000006:32113736; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249584/249584 100.0%; Applied: 6391; Backlog: 550/1000; Time: 15s(total), 8s(copy); streamer: mysql-bin-changelog.000006:33782524; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
2024-06-19 05:03:35 INFO Grabbing voluntary lock: gh-ost.89.lock
2024-06-19 05:03:35 INFO Setting LOCK timeout as 6 seconds
2024-06-19 05:03:35 INFO Looking for magic cut-over table
2024-06-19 05:03:35 INFO Creating magic cut-over table `sysbench`.`_sbtest1_del`
2024-06-19 05:03:35 INFO Magic cut-over table created
2024-06-19 05:03:35 INFO Locking `sysbench`.`sbtest1`, `sysbench`.`_sbtest1_del`
2024-06-19 05:03:35 INFO Tables locked
2024-06-19 05:03:35 INFO Session locking original & magic tables is 89
2024-06-19 05:03:35 INFO Writing changelog state: AllEventsUpToLockProcessed:1718773415465582443
2024-06-19 05:03:35 INFO Intercepted changelog state AllEventsUpToLockProcessed
2024-06-19 05:03:35 INFO Handled changelog state AllEventsUpToLockProcessed
2024-06-19 05:03:35 INFO Waiting for events up to lock
2024-06-19 05:03:35 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1718773415465582443
2024-06-19 05:03:35 INFO Done waiting for events up to lock; duration=175.935047ms
# Migrating `sysbench`.`sbtest1`; Ghost table is `sysbench`.`_sbtest1_gho`
# Migrating ip-172-17-5-173:3306; inspecting ip-172-17-5-173:3306; executing on i-09fce0c3dab1a3eb2.ec2.internal
# Migration started at Wed Jun 19 05:03:20 +0000 2024
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=25; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag
# panic-flag-file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.sysbench.sbtest1.sock
Copy: 249584/249584 100.0%; Applied: 7082; Backlog: 0/1000; Time: 15s(total), 8s(copy); streamer: mysql-bin-changelog.000006:34501096; Lag: 0.01s, HeartbeatLag: 0.04s, State: migrating; ETA: due
2024-06-19 05:03:35 INFO Setting RENAME timeout as 3 seconds
2024-06-19 05:03:35 INFO Session renaming tables is 83
2024-06-19 05:03:35 INFO Issuing and expecting this to block: rename /* gh-ost */ table `sysbench`.`sbtest1` to `sysbench`.`_sbtest1_del`, `sysbench`.`_sbtest1_gho` to `sysbench`.`sbtest1`
2024-06-19 05:03:35 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
2024-06-19 05:03:35 INFO Checking session lock: gh-ost.89.lock
2024-06-19 05:03:35 INFO Connection holding lock on original table still exists
2024-06-19 05:03:35 INFO Will now proceed to drop magic table and unlock tables
2024-06-19 05:03:35 INFO Dropping magic cut-over table
2024-06-19 05:03:35 INFO Releasing lock from `sysbench`.`sbtest1`, `sysbench`.`_sbtest1_del`
2024-06-19 05:03:35 INFO Tables unlocked
2024-06-19 05:03:35 INFO Tables renamed
2024-06-19 05:03:35 INFO Lock & rename duration: 206.238705ms. During this time, queries on `sbtest1` were blocked
[2024/06/19 05:03:35] [info] binlogsyncer.go:164 syncer is closing...
[2024/06/19 05:03:35] [error] binlogstreamer.go:77 close sync with err: sync is been closing...
2024-06-19 05:03:35 INFO Closed streamer connection. err=<nil>
[2024/06/19 05:03:35] [info] binlogsyncer.go:179 syncer is closed
2024-06-19 05:03:35 INFO Dropping table `sysbench`.`_sbtest1_ghc`
2024-06-19 05:03:35 INFO Table dropped
2024-06-19 05:03:35 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2024-06-19 05:03:35 INFO -- drop table `sysbench`.`_sbtest1_del`
2024-06-19 05:03:35 INFO Done migrating `sysbench`.`sbtest1`
2024-06-19 05:03:35 INFO Removing socket file: /tmp/gh-ost.sysbench.sbtest1.sock
2024-06-19 05:03:35 INFO Tearing down inspector
2024-06-19 05:03:35 INFO Tearing down applier
2024-06-19 05:03:35 INFO Tearing down streamer
2024-06-19 05:03:35 INFO Tearing down throttler
# Done
mysql> select @@aurora_binlog_replication_max_yield_seconds,@@aurora_enable_repl_bin_log_filtering,@@aurora_version;
+-----------------------------------------------+----------------------------------------+------------------+
| @@aurora_binlog_replication_max_yield_seconds | @@aurora_enable_repl_bin_log_filtering | @@aurora_version |
+-----------------------------------------------+----------------------------------------+------------------+
| 0 | 1 | 3.07.0 |
+-----------------------------------------------+----------------------------------------+------------------+
1 row in set (0.00 sec)
[ec2-user@i-09fce0c3dab1a3eb2 ~]$ gh-ost --max-load=Threads_running=25 --critical-load=Threads_running=1000 --chunk-size=1000 --max-lag-millis=1500 --user="user" --password="password" --host=<host>us-east-1.rds.amazonaws.com --allow-on-master --database="sysbench" --table="sbtest1" --verbose --alter="engine=innodb" --switch-to-rbr --allow-master-master --cut-over=default --exact-rowcount --concurrent-rowcount --default-retries=120 --panic-flag-file=/tmp/ghost.panic.flag --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag --initially-drop-old-table --execute
2024-06-19 05:12:28 INFO starting gh-ost 1.1.5
2024-06-19 05:12:28 INFO Migrating `sysbench`.`sbtest1`
2024-06-19 05:12:28 INFO inspector connection validated on <host>us-east-1.rds.amazonaws.com:3306
2024-06-19 05:12:28 INFO User has REPLICATION CLIENT, REPLICATION SLAVE privileges, and has ALL privileges on `sysbench`.*
2024-06-19 05:12:28 INFO binary logs validated on <host>us-east-1.rds.amazonaws.com:3306
2024-06-19 05:12:28 INFO Restarting replication on <host>us-east-1.rds.amazonaws.com:3306 to make sure binlog settings apply to replication thread
2024-06-19 05:12:28 INFO Inspector initiated on ip-172-17-5-142:3306, version 8.0.36
2024-06-19 05:12:28 INFO Table found. Engine=InnoDB
2024-06-19 05:12:28 INFO Estimated number of rows via EXPLAIN: 246674
2024-06-19 05:12:28 INFO Recursively searching for replication master
2024-06-19 05:12:28 INFO Master found to be ip-172-17-5-142:3306
2024-06-19 05:12:28 INFO log_slave_updates validated on <host>us-east-1.rds.amazonaws.com:3306
2024-06-19 05:12:28 INFO streamer connection validated on <host>us-east-1.rds.amazonaws.com:3306
2024-06-19 05:12:28 INFO Connecting binlog streamer at mysql-bin-changelog.000006:9395302
[2024/06/19 05:12:28] [info] binlogsyncer.go:133 create BinlogSyncer with config {99999 mysql <host>us-east-1.rds.amazonaws.com 3306 user false false <nil> false UTC true 0 0s 0s 0 false}
[2024/06/19 05:12:28] [info] binlogsyncer.go:354 begin to sync binlog from position (mysql-bin-changelog.000006, 9395302)
[2024/06/19 05:12:28] [info] binlogsyncer.go:203 register slave for master server <host>us-east-1.rds.amazonaws.com:3306
2024-06-19 05:12:28 INFO applier connection validated on <host>us-east-1.rds.amazonaws.com:3306
2024-06-19 05:12:28 INFO applier connection validated on <host>us-east-1.rds.amazonaws.com:3306
2024-06-19 05:12:28 INFO will use time_zone='SYSTEM' on applier
2024-06-19 05:12:28 INFO Examining table structure on applier
2024-06-19 05:12:28 INFO Applier initiated on ip-172-17-5-142:3306, version 8.0.36
2024-06-19 05:12:28 INFO Dropping table `sysbench`.`_sbtest1_del`
2024-06-19 05:12:28 INFO Table dropped
2024-06-19 05:12:28 INFO Dropping table `sysbench`.`_sbtest1_ghc`
2024-06-19 05:12:28 INFO Table dropped
2024-06-19 05:12:28 INFO Creating changelog table `sysbench`.`_sbtest1_ghc`
[2024/06/19 05:12:28] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000006, 9395302)
2024-06-19 05:12:28 INFO rotate to next log from mysql-bin-changelog.000006:0 to mysql-bin-changelog.000006
2024-06-19 05:12:28 INFO Changelog table created
2024-06-19 05:12:28 INFO Creating ghost table `sysbench`.`_sbtest1_gho`
2024-06-19 05:12:28 INFO Ghost table created
2024-06-19 05:12:28 INFO Altering ghost table `sysbench`.`_sbtest1_gho`
2024-06-19 05:12:28 INFO Ghost table altered
2024-06-19 05:12:28 INFO Altering ghost table AUTO_INCREMENT value `sysbench`.`_sbtest1_gho`
2024-06-19 05:12:28 INFO Ghost table AUTO_INCREMENT altered
2024-06-19 05:12:28 INFO Created postpone-cut-over-flag-file: /tmp/ghost.postpone.flag
2024-06-19 05:12:28 INFO Intercepted changelog state GhostTableMigrated
2024-06-19 05:12:28 INFO Waiting for ghost table to be migrated. Current lag is 0s
2024-06-19 05:12:28 INFO Handled changelog state GhostTableMigrated
2024-06-19 05:12:28 INFO Chosen shared unique key is PRIMARY
2024-06-19 05:12:28 INFO Shared columns are id,k,c,pad
2024-06-19 05:12:28 INFO Listening on unix socket file: /tmp/gh-ost.sysbench.sbtest1.sock
2024-06-19 05:12:28 INFO As instructed, counting rows in the background; meanwhile I will use an estimated count, and will update it later on
2024-06-19 05:12:28 INFO As instructed, I'm issuing a SELECT COUNT(*) on the table. This may take a while
2024-06-19 05:12:28 INFO Migration min values: [1]
2024-06-19 05:12:28 INFO Migration max values: [250000]
2024-06-19 05:12:28 INFO Waiting for first throttle metrics to be collected
2024-06-19 05:12:28 INFO Exact number of rows via COUNT: 249991
2024-06-19 05:12:28 INFO First throttle metrics collected
# Migrating `sysbench`.`sbtest1`; Ghost table is `sysbench`.`_sbtest1_gho`
# Migrating ip-172-17-5-142:3306; inspecting ip-172-17-5-142:3306; executing on i-09fce0c3dab1a3eb2.ec2.internal
# Migration started at Wed Jun 19 05:12:28 +0000 2024
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=25; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag [set]
# panic-flag-file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.sysbench.sbtest1.sock
Copy: 0/249991 0.0%; Applied: 0; Backlog: 0/1000; Time: 0s(total), 0s(copy); streamer: mysql-bin-changelog.000006:9570832; Lag: 0.02s, HeartbeatLag: 0.02s, State: migrating; ETA: N/A
2024-06-19 05:12:28 INFO Intercepted changelog state ReadMigrationRangeValues
2024-06-19 05:12:28 INFO Handled changelog state ReadMigrationRangeValues
Copy: 29993/250100 12.0%; Applied: 442; Backlog: 4/1000; Time: 1s(total), 1s(copy); streamer: mysql-bin-changelog.000006:16349500; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 7s
Copy: 59975/250178 24.0%; Applied: 832; Backlog: 0/1000; Time: 2s(total), 2s(copy); streamer: mysql-bin-changelog.000006:23464017; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 6s
Copy: 88946/250229 35.5%; Applied: 1199; Backlog: 14/1000; Time: 3s(total), 3s(copy); streamer: mysql-bin-changelog.000006:30258476; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 5s
Copy: 115910/250285 46.3%; Applied: 1590; Backlog: 13/1000; Time: 4s(total), 4s(copy); streamer: mysql-bin-changelog.000006:36607789; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 4s
Copy: 142868/250339 57.1%; Applied: 2016; Backlog: 8/1000; Time: 5s(total), 5s(copy); streamer: mysql-bin-changelog.000006:43082810; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 3s
Copy: 169805/250371 67.8%; Applied: 2365; Backlog: 4/1000; Time: 6s(total), 6s(copy); streamer: mysql-bin-changelog.000006:49467553; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: 2s
Copy: 196750/250394 78.6%; Applied: 2755; Backlog: 9/1000; Time: 7s(total), 7s(copy); streamer: mysql-bin-changelog.000006:55926488; Lag: 0.01s, HeartbeatLag: 0.12s, State: migrating; ETA: 1s
Copy: 222669/250411 88.9%; Applied: 3139; Backlog: 0/1000; Time: 8s(total), 8s(copy); streamer: mysql-bin-changelog.000006:62177925; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: due
Copy: 247582/250414 98.9%; Applied: 3582; Backlog: 2/1000; Time: 9s(total), 9s(copy); streamer: mysql-bin-changelog.000006:68357393; Lag: 0.01s, HeartbeatLag: 0.02s, State: migrating; ETA: due
2024-06-19 05:12:37 INFO Row copy complete
Copy: 249569/249569 100.0%; Applied: 3612; Backlog: 0/1000; Time: 9s(total), 9s(copy); streamer: mysql-bin-changelog.000006:68840292; Lag: 0.01s, HeartbeatLag: 0.10s, State: migrating; ETA: due
Copy: 249569/249569 100.0%; Applied: 3615; Backlog: 441/1000; Time: 10s(total), 9s(copy); streamer: mysql-bin-changelog.000006:70195592; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 4333; Backlog: 290/1000; Time: 11s(total), 9s(copy); streamer: mysql-bin-changelog.000006:72053715; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 5053; Backlog: 100/1000; Time: 12s(total), 9s(copy); streamer: mysql-bin-changelog.000006:73833081; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 5210; Backlog: 529/1000; Time: 13s(total), 9s(copy); streamer: mysql-bin-changelog.000006:75499945; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 6058; Backlog: 209/1000; Time: 14s(total), 9s(copy); streamer: mysql-bin-changelog.000006:77389881; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 6578; Backlog: 206/1000; Time: 15s(total), 9s(copy); streamer: mysql-bin-changelog.000006:79224057; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 6900; Backlog: 460/1000; Time: 16s(total), 9s(copy); streamer: mysql-bin-changelog.000006:80953330; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 7765; Backlog: 132/1000; Time: 17s(total), 9s(copy); streamer: mysql-bin-changelog.000006:82906479; Lag: 0.02s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 8125; Backlog: 347/1000; Time: 18s(total), 9s(copy); streamer: mysql-bin-changelog.000006:84637821; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 8665; Backlog: 373/1000; Time: 19s(total), 9s(copy); streamer: mysql-bin-changelog.000006:86474673; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 9355; Backlog: 168/1000; Time: 20s(total), 9s(copy); streamer: mysql-bin-changelog.000006:88128534; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 9609; Backlog: 495/1000; Time: 21s(total), 9s(copy); streamer: mysql-bin-changelog.000006:89868954; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 10419; Backlog: 162/1000; Time: 22s(total), 9s(copy); streamer: mysql-bin-changelog.000006:91730606; Lag: 0.02s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 10869; Backlog: 270/1000; Time: 23s(total), 9s(copy); streamer: mysql-bin-changelog.000006:93477136; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 11270; Backlog: 374/1000; Time: 24s(total), 9s(copy); streamer: mysql-bin-changelog.000006:95108250; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 12093; Backlog: 106/1000; Time: 25s(total), 9s(copy); streamer: mysql-bin-changelog.000006:97023664; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 12363; Backlog: 414/1000; Time: 26s(total), 9s(copy); streamer: mysql-bin-changelog.000006:98721735; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 12986; Backlog: 309/1000; Time: 27s(total), 9s(copy); streamer: mysql-bin-changelog.000006:100533872; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 13745; Backlog: 62/1000; Time: 28s(total), 9s(copy); streamer: mysql-bin-changelog.000006:102297914; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 13905; Backlog: 442/1000; Time: 29s(total), 9s(copy); streamer: mysql-bin-changelog.000006:103944827; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 14605; Backlog: 263/1000; Time: 30s(total), 9s(copy); streamer: mysql-bin-changelog.000006:105781890; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 15438; Backlog: 11/1000; Time: 31s(total), 9s(copy); streamer: mysql-bin-changelog.000006:107730161; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 15458; Backlog: 482/1000; Time: 32s(total), 9s(copy); streamer: mysql-bin-changelog.000006:109295632; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
Copy: 249569/249569 100.0%; Applied: 16132; Backlog: 339/1000; Time: 33s(total), 9s(copy); streamer: mysql-bin-changelog.000006:111135451; Lag: 0.01s, HeartbeatLag: 0.02s, State: postponing cut-over; ETA: due
2024-06-19 05:13:01 INFO Grabbing voluntary lock: gh-ost.46.lock
2024-06-19 05:13:01 INFO Setting LOCK timeout as 6 seconds
2024-06-19 05:13:01 INFO Looking for magic cut-over table
2024-06-19 05:13:01 INFO Creating magic cut-over table `sysbench`.`_sbtest1_del`
2024-06-19 05:13:01 INFO Magic cut-over table created
2024-06-19 05:13:01 INFO Locking `sysbench`.`sbtest1`, `sysbench`.`_sbtest1_del`
2024-06-19 05:13:01 INFO Tables locked
2024-06-19 05:13:01 INFO Session locking original & magic tables is 46
2024-06-19 05:13:01 INFO Writing changelog state: AllEventsUpToLockProcessed:1718773981687017615
2024-06-19 05:13:01 INFO Intercepted changelog state AllEventsUpToLockProcessed
2024-06-19 05:13:01 INFO Handled changelog state AllEventsUpToLockProcessed
2024-06-19 05:13:01 INFO Waiting for events up to lock
2024-06-19 05:13:02 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1718773981687017615
2024-06-19 05:13:02 INFO Done waiting for events up to lock; duration=515.236968ms
# Migrating `sysbench`.`sbtest1`; Ghost table is `sysbench`.`_sbtest1_gho`
# Migrating ip-172-17-5-142:3306; inspecting ip-172-17-5-142:3306; executing on i-09fce0c3dab1a3eb2.ec2.internal
# Migration started at Wed Jun 19 05:12:28 +0000 2024
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=25; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag
# panic-flag-file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.sysbench.sbtest1.sock
Copy: 249569/249569 100.0%; Applied: 16525; Backlog: 0/1000; Time: 33s(total), 9s(copy); streamer: mysql-bin-changelog.000006:111508579; Lag: 0.01s, HeartbeatLag: 0.04s, State: migrating; ETA: due
2024-06-19 05:13:02 INFO Setting RENAME timeout as 3 seconds
2024-06-19 05:13:02 INFO Session renaming tables is 47
2024-06-19 05:13:02 INFO Issuing and expecting this to block: rename /* gh-ost */ table `sysbench`.`sbtest1` to `sysbench`.`_sbtest1_del`, `sysbench`.`_sbtest1_gho` to `sysbench`.`sbtest1`
2024-06-19 05:13:02 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
2024-06-19 05:13:02 INFO Checking session lock: gh-ost.46.lock
2024-06-19 05:13:02 INFO Connection holding lock on original table still exists
2024-06-19 05:13:02 INFO Will now proceed to drop magic table and unlock tables
2024-06-19 05:13:02 INFO Dropping magic cut-over table
2024-06-19 05:13:02 INFO Releasing lock from `sysbench`.`sbtest1`, `sysbench`.`_sbtest1_del`
2024-06-19 05:13:02 INFO Tables unlocked
2024-06-19 05:13:02 INFO Tables renamed
2024-06-19 05:13:02 INFO Lock & rename duration: 548.080798ms. During this time, queries on `sbtest1` were blocked
[2024/06/19 05:13:02] [info] binlogsyncer.go:164 syncer is closing...
[2024/06/19 05:13:02] [error] binlogstreamer.go:77 close sync with err: sync is been closing...
2024-06-19 05:13:02 INFO Closed streamer connection. err=<nil>
2024-06-19 05:13:02 INFO Dropping table `sysbench`.`_sbtest1_ghc`
[2024/06/19 05:13:02] [info] binlogsyncer.go:179 syncer is closed
2024-06-19 05:13:02 INFO Table dropped
2024-06-19 05:13:02 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2024-06-19 05:13:02 INFO -- drop table `sysbench`.`_sbtest1_del`
2024-06-19 05:13:02 INFO Done migrating `sysbench`.`sbtest1`
2024-06-19 05:13:02 INFO Removing socket file: /tmp/gh-ost.sysbench.sbtest1.sock
2024-06-19 05:13:02 INFO Tearing down inspector
2024-06-19 05:13:02 INFO Tearing down applier
2024-06-19 05:13:02 INFO Tearing down streamer
2024-06-19 05:13:02 INFO Tearing down throttler
# Done
Test: gh-ost command:
gh-ost --max-load=Threads_running=25 --critical-load=Threads_running=1000 --chunk-size=1000 --max-lag-millis=1500 --user="user" --password="password" --host=<host>.us-east-1.rds.amazonaws.com --allow-on-master --database="sysbench" --table="sbtest1" --verbose --alter="engine=innodb" --switch-to-rbr --allow-master-master --cut-over=default --exact-rowcount --concurrent-rowcount --default-retries=120 --panic-flag-file=/tmp/ghost.panic.flag --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag --initially-drop-old-table --execute
sysbench command running in parallel:
sysbench ./sysbench/src/lua/oltp_write_only.lua --db-driver=mysql --mysql-user=user --mysql-password=password --mysql-db=sysbench --mysql-host=<host>-east-1.rds.amazonaws.com --mysql-port=3306 --threads=10 --rand-type=uniform --db-ps-mode=disable --tables=5 --forced-shutdown --table-size=250000 --report-interval=1 --time=0 --skip-trx run
Interesting! We'd tried it on slightly earlier aurora version and had problems. We bumped our minor version just the other day and hadn't tried since, but an attempt just now works.
We need to do a little more investigation as we may have made a mistake in our config that we didn't notice when making our earlier attempts
Whelp... turns out we had a #facepalm level configuration issue. Thanks for posting this issue as it caused us to go back and triple check things (and find our obvious-in-hindsight problem)
Thanks for double checking this! Based on testing this seems to confirm
aurora_enable_repl_bin_log_filtering
but aurora_binlog_replication_max_yield_seconds
aurora_binlog_replication_max_yield_seconds
is no longer required in the latest Aurora versions, and recommendation is to leave it set to zero. (versions 2.10+/3.01+).
You don't need to adjust any configuration parameters to turn on this optimization. In particular, if you adjust the configuration parameter aurora_binlog_replication_max_yield_seconds to a nonzero value in earlier Aurora MySQL versions, set it back to zero for Aurora MySQL 2.10 and higher.
Since the default for aurora_binlog_replication_max_yield_seconds
is zero in latest versions, I would recommend updating gh-ost docs to:
aurora_enable_repl_bin_log_filtering
aurora_binlog_replication_max_yield_seconds
is not set to a non-zero value to prevent issues with heartbeat lag during cutover.
Hi!
In the RDS documentation it recommends disabling
aurora_enable_repl_bin_log_filtering
when using GH-OST in Aurora MySQL. Looking through the history I can see a few references to this potentially causing issues with cutovers, so I was looking to see if there was a reproducible case which can demonstrate this?aurora_enable_repl_bin_log_filtering
BackgroundIn Aurora, the writer instance in a cluster sends redo log records to the aurora storage volume and reader nodes in the cluster. Since Aurora MySQL stores binary logs in the cluster volume, redo log records are also be generated for binary logs under the hood. This is all transparent to users, and the binary logs are presented like they would be in community MySQL. When
aurora_enable_repl_bin_log_filtering
is enabled, redo logs for binary log records will still be sent to Aurora storage, but they will not be sent to aurora readers within the same cluster. Without filtering, these redo logs will be sent to reader instances and discarded since binary logs are not accessible from readers. This can lead to unnecessary amplification of network traffic on the writer/reader so the recommendation is to leaveaurora_enable_repl_bin_log_filtering
enabled. Binlog filtering is always enabled in Aurora MySQL version 3, so the parameter was removed in the Aurora 3 major version. This setting should not be confused with MySQL binlog replication filtering.Since binary logs are not accessible on readers, and
aurora_enable_repl_bin_log_filtering
only affects internal transportation of redo, I highly suspect this may not be the read cause of the cutover issues seen.Theory
What I suspect is the real cause is the
aurora_binlog_replication_max_yield_seconds
parameter.aurora_binlog_replication_max_yield_seconds
Backgroundaurora_binlog_replication_max_yield_seconds
was introduced along withaurora_binlog_read_buffer_size
to improve read performance of binary log consumer threads. The idea here was, to increase the IO read size(aurora_binlog_read_buffer_size
) for each binlog read req made by a consumer thread to improve IO efficiency/throughput of binlog consumer threads. Instead of reading binary logs from aurora storage in 8K chunks, you could configure the read buffer size of each binlog read request. The drawback was that if you were reading in larger chunks, it could lead to more contention with foreground transactions. To allow users configure this tradeoffaurora_binlog_replication_max_yield_seconds
was introduced. Instead of constantly contending with foreground transactions on the active binary log you could configure binary log consumers to "backoff" or "yield" for a number of seconds defined byaurora_binlog_replication_max_yield_seconds
.Note: In Aurora version 2.10 the binlog IO cache was introduced(more information here) which removed the need for the above yield second configurations. The binary log I/O cache aims to minimize read I/O from the Aurora storage layer by keeping the most recent binary log change events in its circular cache on your writer DB instance. This enhancement is enabled by default, so should not require any additional configurations. from docs:
What I suspect here is that if
aurora_binlog_replication_max_yield_seconds
is configured to a non-zero value, the heartbeat in GH-OST cutovers will never reach zero for long enough to allow the cutover succeed, as it will read an event, sleep("yield"), repeat.Test
I done some basic testing on my side to validate this and it seems to be the case, but would love to hear from others. With yield seconds set to
60
, my heartbeatLag does not decrease:Shortly after I set
aurora_binlog_replication_max_yield_seconds
back to zero in my parameter group, the cutover succeeded:Would love to hear what you think.
Thanks, Marc