Open dontstopbelieveing opened 8 months ago
Created this PR for updating the documentation - https://github.com/github/gh-ost/pull/1388
after this error gh-ost went back to the migrating stage, supposedly to do rollbacks which caused an availability issue and we had to kill the process. What is it rolling back, since no cutover was done, should it not simply abandon things and release locks?
@dontstopbelieveing could you elaborate on this point?
To my knowledge gh-ost
doesn't have any concept of "rolling something back", so I wonder if what you're seeing is a MySQL/InnoDB behaviour
I'll add what we see in our test here, we land at the postpone cutover stage
Copy: 100000000/100000000 100.0%; Applied: 751753; Backlog: 0/1000; Time: 1h27m30s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232697:134650910; Lag: 0.09s, HeartbeatLag: 9.49s, State: postponing cut-over; ETA: due
[2024/03/09 11:31:04] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232698, 4)
[2024/03/09 11:31:04] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232698, 4)
2024-03-09 11:31:04 INFO rotate to next log from mysql-bin-changelog.232698:134667787 to mysql-bin-changelog.232698
2024-03-09 11:31:04 INFO rotate to next log from mysql-bin-changelog.232698:0 to mysql-bin-changelog.232698
Copy: 100000000/100000000 100.0%; Applied: 752224; Backlog: 0/1000; Time: 1h28m0s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232698:134396827; Lag: 0.09s, HeartbeatLag: 18.59s, State: postponing cut-over; ETA: due
[2024/03/09 11:31:24] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232699, 4)
[2024/03/09 11:31:24] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232699, 4)
2024-03-09 11:31:24 INFO rotate to next log from mysql-bin-changelog.232699:134397338 to mysql-bin-changelog.232699
2024-03-09 11:31:24 INFO rotate to next log from mysql-bin-changelog.232699:0 to mysql-bin-changelog.232699
2024-03-09 11:31:24 INFO rotate to next log from mysql-bin-changelog.232699:0 to mysql-bin-changelog.232699
And then once we delete the cutover flag
2024-03-09 11:31:27 INFO Grabbing voluntary lock: gh-ost.2374.lock
2024-03-09 11:31:27 INFO Setting LOCK timeout as 6 seconds
2024-03-09 11:31:27 INFO Looking for magic cut-over table
2024-03-09 11:31:27 INFO Creating magic cut-over table `sbtest`.`_sbtest1_del`
2024-03-09 11:31:27 INFO Magic cut-over table created
2024-03-09 11:31:27 INFO Locking `sbtest`.`sbtest1`, `sbtest`.`_sbtest1_del`
2024-03-09 11:31:27 INFO Tables locked
2024-03-09 11:31:27 INFO Session locking original & magic tables is 2374
2024-03-09 11:31:27 INFO Writing changelog state: AllEventsUpToLockProcessed:1709983887171918805
2024-03-09 11:31:27 INFO Waiting for events up to lock
2024-03-09 11:31:30 ERROR Timeout while waiting for events up to lock
2024-03-09 11:31:30 ERROR 2024-03-09 11:31:30 ERROR Timeout while waiting for events up to lock
2024-03-09 11:31:30 INFO Looking for magic cut-over table
2024-03-09 11:31:30 INFO Will now proceed to drop magic table and unlock tables
2024-03-09 11:31:30 INFO Dropping magic cut-over table
2024-03-09 11:31:30 INFO Dropping magic cut-over table
2024-03-09 11:31:30 INFO Dropping table `sbtest`.`_sbtest1_del`
So far so good. At this point I would expect metadata locks to be released. But they don't get released and the log has these entries
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m10s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 7.19s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m15s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 12.19s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m20s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 17.19s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m25s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 22.19s, State: migrating; ETA: due
2024-03-09 11:31:46 INFO rotate to next log from mysql-bin-changelog.232700:134391842 to mysql-bin-changelog.232700
2024-03-09 11:31:46 INFO rotate to next log from mysql-bin-changelog.232700:0 to mysql-bin-changelog.232700
[2024/03/09 11:31:46] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232700, 4)
[2024/03/09 11:31:46] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232700, 4)
2024-03-09 11:31:47 INFO Intercepted changelog state AllEventsUpToLockProcessed
2024-03-09 11:31:47 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m30s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 5.79s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m35s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 10.79s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m40s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 15.79s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m45s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 20.79s, State: migrating; ETA: due
This continued till we manually killed the gh-ost process.
I am baffled about two issues -
The reason I said "rolling something back" is because the effect I see on MySQL is similar to if a long running transaction might do a rollback. This might not be a rollback but gh-ost running something else.
This comes out of an issue we kept running into causing great pain and outages
We would run with
Then once row copy was complete, we would be in migrating stage for a long time applying binlogs. At this point heartbeat lag would be 10-30 seconds. We thought if we increased
max-lag-millis
from 1500 to 10000 this would give us less throttling and speed up binlog reading and applies (silly us!)Heartbeat lag would drop below 10 seconds, we would remove the cutover file and then run into
"ERROR Timeout while waiting for events up to lock"
which made sense since 10 seconds > cutover lock timeout of 6 seconds
Our ask is that we edit the documentation to point out this important effect of the seemingly innocent parameter as evident here https://github.com/github/gh-ost/blob/master/go/logic/migrator.go#L504
I also have 2 questions,
For context we are on AWS Aurora and the high hearbeat lag is a side effect of
aurora_binlog_replication_max_yield_seconds
set to non-zero