max-lag-millis parameter description is misleading when running on master

dontstopbelieveing commented 8 months ago

This comes out of an issue we kept running into causing great pain and outages

We would run with

gh-ost --allow-on-master -assume-rbr \
-critical-load Threads_running=700 \
-critical-load-hibernate-seconds 60 \
-max-load Threads_running=100 \
-nice-ratio 0.2 \
-chunk-size 5000 \
-ask-pass \
-database=sbtest \
-table=sbtest1 \
-user sbtest \
-host=XXX \
-postpone-cut-over-flag-file /tmp/gh-ost-postpone-cut-over-flag-sbtest1.log \
-skip-foreign-key-checks -throttle-additional-flag-file /tmp/gh-ost-throttle_flag-sbtest1.log \
--alter "add column c1 int(11) NOT NULL DEFAULT '1';" \
-verbose -initially-drop-old-table -execute 2>&1 | tee /tmp/gh-ost-sbtest1.log

Then once row copy was complete, we would be in migrating stage for a long time applying binlogs. At this point heartbeat lag would be 10-30 seconds. We thought if we increased max-lag-millis from 1500 to 10000 this would give us less throttling and speed up binlog reading and applies (silly us!)

Heartbeat lag would drop below 10 seconds, we would remove the cutover file and then run into "ERROR Timeout while waiting for events up to lock"

which made sense since 10 seconds > cutover lock timeout of 6 seconds

Our ask is that we edit the documentation to point out this important effect of the seemingly innocent parameter as evident here https://github.com/github/gh-ost/blob/master/go/logic/migrator.go#L504

if heartbeatLag > maxLagMillisecondsThrottle || heartbeatLag > cutOverLockTimeout {
                this.migrationContext.Log.Debugf("current HeartbeatLag (%.2fs) is too high, it needs to be less than both --max-lag-millis (%.2fs) and --cut-over-lock-timeout-seconds (%.2fs) to continue", heartbeatLag.Seconds(), maxLagMillisecondsThrottle.Seconds(), cutOverLockTimeout.Seconds())
                return true, nil
            }

I also have 2 questions,

after this error gh-ost went back to the migrating stage, supposedly to do rollbacks which caused an availability issue and we had to kill the process. What is it rolling back, since no cutover was done, should it not simply abandon things and release locks?
we obviously didn't wait for rollback to complete but should the metadata locks not released prior to the rollback and just after this error occurs? In our case both the new and old tables remained locked till we killed the process.

For context we are on AWS Aurora and the high hearbeat lag is a side effect of aurora_binlog_replication_max_yield_seconds set to non-zero

dontstopbelieveing commented 8 months ago

Created this PR for updating the documentation - https://github.com/github/gh-ost/pull/1388

timvaillancourt commented 8 months ago

after this error gh-ost went back to the migrating stage, supposedly to do rollbacks which caused an availability issue and we had to kill the process. What is it rolling back, since no cutover was done, should it not simply abandon things and release locks?

@dontstopbelieveing could you elaborate on this point?

To my knowledge gh-ost doesn't have any concept of "rolling something back", so I wonder if what you're seeing is a MySQL/InnoDB behaviour

dontstopbelieveing commented 8 months ago

I'll add what we see in our test here, we land at the postpone cutover stage

Copy: 100000000/100000000 100.0%; Applied: 751753; Backlog: 0/1000; Time: 1h27m30s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232697:134650910; Lag: 0.09s, HeartbeatLag: 9.49s, State: postponing cut-over; ETA: due
[2024/03/09 11:31:04] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232698, 4)
[2024/03/09 11:31:04] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232698, 4)
2024-03-09 11:31:04 INFO rotate to next log from mysql-bin-changelog.232698:134667787 to mysql-bin-changelog.232698
2024-03-09 11:31:04 INFO rotate to next log from mysql-bin-changelog.232698:0 to mysql-bin-changelog.232698
Copy: 100000000/100000000 100.0%; Applied: 752224; Backlog: 0/1000; Time: 1h28m0s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232698:134396827; Lag: 0.09s, HeartbeatLag: 18.59s, State: postponing cut-over; ETA: due
[2024/03/09 11:31:24] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232699, 4)
[2024/03/09 11:31:24] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232699, 4)
2024-03-09 11:31:24 INFO rotate to next log from mysql-bin-changelog.232699:134397338 to mysql-bin-changelog.232699
2024-03-09 11:31:24 INFO rotate to next log from mysql-bin-changelog.232699:0 to mysql-bin-changelog.232699
2024-03-09 11:31:24 INFO rotate to next log from mysql-bin-changelog.232699:0 to mysql-bin-changelog.232699

And then once we delete the cutover flag

2024-03-09 11:31:27 INFO Grabbing voluntary lock: gh-ost.2374.lock
2024-03-09 11:31:27 INFO Setting LOCK timeout as 6 seconds
2024-03-09 11:31:27 INFO Looking for magic cut-over table
2024-03-09 11:31:27 INFO Creating magic cut-over table `sbtest`.`_sbtest1_del`
2024-03-09 11:31:27 INFO Magic cut-over table created
2024-03-09 11:31:27 INFO Locking `sbtest`.`sbtest1`, `sbtest`.`_sbtest1_del`
2024-03-09 11:31:27 INFO Tables locked
2024-03-09 11:31:27 INFO Session locking original & magic tables is 2374
2024-03-09 11:31:27 INFO Writing changelog state: AllEventsUpToLockProcessed:1709983887171918805
2024-03-09 11:31:27 INFO Waiting for events up to lock
2024-03-09 11:31:30 ERROR Timeout while waiting for events up to lock
2024-03-09 11:31:30 ERROR 2024-03-09 11:31:30 ERROR Timeout while waiting for events up to lock
2024-03-09 11:31:30 INFO Looking for magic cut-over table
2024-03-09 11:31:30 INFO Will now proceed to drop magic table and unlock tables
2024-03-09 11:31:30 INFO Dropping magic cut-over table
2024-03-09 11:31:30 INFO Dropping magic cut-over table
2024-03-09 11:31:30 INFO Dropping table `sbtest`.`_sbtest1_del`

So far so good. At this point I would expect metadata locks to be released. But they don't get released and the log has these entries

Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m10s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 7.19s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m15s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 12.19s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m20s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 17.19s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m25s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 22.19s, State: migrating; ETA: due
2024-03-09 11:31:46 INFO rotate to next log from mysql-bin-changelog.232700:134391842 to mysql-bin-changelog.232700
2024-03-09 11:31:46 INFO rotate to next log from mysql-bin-changelog.232700:0 to mysql-bin-changelog.232700
[2024/03/09 11:31:46] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232700, 4)
[2024/03/09 11:31:46] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232700, 4)
2024-03-09 11:31:47 INFO Intercepted changelog state AllEventsUpToLockProcessed
2024-03-09 11:31:47 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m30s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 5.79s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m35s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 10.79s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m40s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 15.79s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m45s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 20.79s, State: migrating; ETA: due

This continued till we manually killed the gh-ost process.

I am baffled about two issues -

After the error the metadata locks should have been released and possibly also gh-ost should exit, but this did not happen.
I am confused what's the migrating stage after is it a retry? Again even the retry should release metadata locks and reacquire IMO.

The reason I said "rolling something back" is because the effect I see on MySQL is similar to if a long running transaction might do a rollback. This might not be a rollback but gh-ost running something else.

github / gh-ost

max-lag-millis parameter description is misleading when running on master #1387