ERROR Timeout while waiting for events up to lock

knepe commented 4 years ago

We have a pretty large and busy (at some times) table where we try to add column xxx int null, everything seems fine until the cut-over phase, it gets an error and then retries for some time but never succeeds. I have checked other issues here which also have this error but haven't found anything really. The binlog is not stuck as it keeps applying and rotating. This is on RDS Aurora, normal setup, we haven't changed any parameters but starting to think that some parameter might need to be changed in the DB? But not sure which in that case. We are running latest master and have tried some different command flags, like for ex --cut-over-exponential-backoff and increasing --cut-over-lock-timeout-seconds and --default-retries. This is the full command right now:

gh-ost \ --user=$user \ --password=$password \ --host=$host \ --database=$database \ --table=$table \ --alter="$alter" \ --max-load=Threads_running=25 \ --critical-load=Threads_running=1000 \ --chunk-size=1000 \ --max-lag-millis=1500 \ --verbose \ --alter="engine=innodb" \ --allow-master-master \ --exact-rowcount \ --concurrent-rowcount \ --default-retries=120 \ --allow-on-master \ --initially-drop-ghost-table \ --initially-drop-old-table \ --cut-over-lock-timeout-seconds=10 \ --assume-rbr \ --cut-over-exponential-backoff \ --execute \

Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m10s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:1363370; Lag: 0.01s, State: migrating; ETA: due
2019-11-06 14:04:27 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:27 ERROR 2019-11-06 14:04:27 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:27 INFO Looking for magic cut-over table
2019-11-06 14:04:27 INFO Will now proceed to drop magic table and unlock tables
2019-11-06 14:04:27 INFO Dropping magic cut-over table
2019-11-06 14:04:27 INFO Dropping magic cut-over table
2019-11-06 14:04:27 INFO Dropping table `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Table dropped
2019-11-06 14:04:27 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Tables unlocked
2019-11-06 14:04:27 INFO Grabbing voluntary lock: gh-ost.428716346.lock
2019-11-06 14:04:27 INFO Setting LOCK timeout as 20 seconds
2019-11-06 14:04:27 INFO Looking for magic cut-over table
2019-11-06 14:04:27 INFO Creating magic cut-over table `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Magic cut-over table created
2019-11-06 14:04:27 INFO Locking `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Tables locked
2019-11-06 14:04:27 INFO Session locking original & magic tables is 428716346
2019-11-06 14:04:27 INFO Writing changelog state: AllEventsUpToLockProcessed:1573049067764852856
2019-11-06 14:04:27 INFO Waiting for events up to lock
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m15s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:16394106; Lag: 0.01s, State: migrating; ETA: due
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m20s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:32408062; Lag: 0.01s, State: migrating; ETA: due
2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:37 ERROR 2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:37 INFO Looking for magic cut-over table
2019-11-06 14:04:37 INFO Will now proceed to drop magic table and unlock tables
2019-11-06 14:04:37 INFO Dropping magic cut-over table
2019-11-06 14:04:37 INFO Dropping magic cut-over table
2019-11-06 14:04:37 INFO Dropping table `xxx`.`_xxx_del`
2019-11-06 14:04:37 INFO Table dropped
2019-11-06 14:04:37 INFO Removing socket file: /tmp/gh-ost.xxx.xxx.sock
2019-11-06 14:04:37 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:37 FATAL 2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock

knepe commented 4 years ago

This is how the _ghc table looks like if that helps:

So if I'm reading the code correctly it's waiting for that AllEventsUpToLockProcessed entry with id: 2 to appear in the binlogs, but it seems it never does end up there for some reason?

cool-firer commented 4 years ago

any solution?

knepe commented 4 years ago

Nope, tried ALOT of different things, but always the same error :(

cool-firer commented 4 years ago

me too :(

shlomi-noach commented 4 years ago

Very sorry for the late response. Running with --debug --stack will provide additional, useful information, if that is possible.

knepe commented 4 years ago

Okay, running this now. Will get back in ~2hrs when its on the cut-over phase

knepe commented 4 years ago

This is the log when it tries to do the cut-over, this repeats x amounts of times (I think I set the retry count to 120) Does this help or do you need logs from an earlier stage?


2019-12-09 12:19:51 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:52 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:53 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:54 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:55 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:56 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:57 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:58 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:59 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:00 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:01 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:02 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:03 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:04 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:05 DEBUG checking for cut-over postpone
2019-12-09 12:20:05 DEBUG checking for cut-over postpone: complete
2019-12-09 12:20:05 INFO Grabbing voluntary lock: gh-ost.451914278.lock
2019-12-09 12:20:05 INFO Setting LOCK timeout as 20 seconds
2019-12-09 12:20:05 INFO Looking for magic cut-over table
2019-12-09 12:20:05 INFO Creating magic cut-over table `xxx`.`_xxx_del`
2019-12-09 12:20:05 INFO Magic cut-over table created
2019-12-09 12:20:05 INFO Locking `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-12-09 12:20:05 INFO Tables locked
2019-12-09 12:20:05 INFO Session locking original & magic tables is 451914278
2019-12-09 12:20:05 INFO Writing changelog state: AllEventsUpToLockProcessed:1575894005304855457
2019-12-09 12:20:05 INFO Waiting for events up to lock
2019-12-09 12:20:05 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h15m50s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:71801235; Lag: 0.01s, State: migrating; ETA: due
2019-12-09 12:20:06 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:07 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:08 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:09 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:10 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h15m55s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:81706458; Lag: 0.11s, State: migrating; ETA: due
2019-12-09 12:20:11 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:12 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:13 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:14 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:15 ERROR Timeout while waiting for events up to lock
2019-12-09 12:20:15 ERROR 2019-12-09 12:20:15 ERROR Timeout while waiting for events up to lock
goroutine 1 [running]:
runtime/debug.Stack(0x5f, 0x100, 0xc0000ce0a0)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        /usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/github/gh-ost/vendor/github.com/outbrain/golib/log.logErrorEntry(0x2, 0x8a5f00, 0xc00007d7e0, 0xc0000c79d8, 0x1)
        /home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/vendor/github.com/outbrain/golib/log/log.go:178 +0xe7
github.com/github/gh-ost/vendor/github.com/outbrain/golib/log.Errore(...)
        /home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/vendor/github.com/outbrain/golib/log/log.go:224
github.com/github/gh-ost/go/logic.(*Migrator).atomicCutOver(0xc000124000, 0x0, 0x0)
        /home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/logic/migrator.go:634 +0x9d3
github.com/github/gh-ost/go/logic.(*Migrator).cutOver(0xc000124000, 0xd4bee0, 0xd4bee0)
        /home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/logic/migrator.go:524 +0x26b
github.com/github/gh-ost/go/logic.(*Migrator).retryOperationWithExponentialBackoff(0xc000124000, 0xc0002173e0, 0x0, 0x0, 0x0, 0x77ff01, 0xc0002173e0)
        /home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/logic/migrator.go:169 +0xde
github.com/github/gh-ost/go/logic.(*Migrator).Migrate(0xc000124000, 0x0, 0x0)
        /home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/logic/migrator.go:413 +0x8f8
main.main()
        /home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/cmd/gh-ost/main.go:278 +0x2725
2019-12-09 12:20:15 INFO Looking for magic cut-over table
2019-12-09 12:20:15 INFO Will now proceed to drop magic table and unlock tables
2019-12-09 12:20:15 INFO Dropping magic cut-over table
2019-12-09 12:20:15 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-12-09 12:20:15 INFO Tables unlocked
2019-12-09 12:20:15 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h16m0s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:92163787; Lag: 0.01s, State: migrating; ETA: due```

shlomi-noach commented 4 years ago

Thank you for the logs, they seem to be enough. This is weird:

2019-12-09 12:20:05 INFO Writing changelog state: AllEventsUpToLockProcessed:1575894005304855457
2019-12-09 12:20:05 INFO Waiting for events up to lock
2019-12-09 12:20:05 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h15m50s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:71801235; Lag: 0.01s, State: migrating; ETA: due
2019-12-09 12:20:06 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:07 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:08 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:09 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:10 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h15m55s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:81706458; Lag: 0.11s, State: migrating; ETA: due
2019-12-09 12:20:11 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:12 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:13 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:14 DEBUG Getting nothing in the write queue. Sleeping...

what this means is that gh-ost wrote an entry on the *_ghc table on the master, and never received the change on the binary log.

Are there any replication filters on this cluster?

knepe commented 4 years ago

Hmm, I googled abit on binlog filtering on AWS, and seems it's enabled by default.

The binlog filtering feature automatically reduces network bandwidth for replication messages. Because the Aurora Replicas don't use the binlog information that is included in the replication messages, that data is omitted from the messages sent to those nodes. You control this feature by changing the aurora_enable_repl_bin_log_filtering parameter. This parameter is on by default. Because this optimization is intended to be transparent, you might turn off this setting only during diagnosis or troubleshooting for issues related to replication. For example, you can do so to match the behavior of an older Aurora MySQL cluster where this feature was not available.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Replication.html#AuroraMySQL.Replication.Performance Will try tomorrow with this off and see what the result is

knepe commented 4 years ago

Unfortunately, after setting aurora_enable_repl_bin_log_filtering: 0 it is still the exact same logs and error :( Any more ideas? :)

knepe commented 4 years ago

No wait, it actually worked :) I wrote too early. After like 1 00 retries I got this:


2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Handled changelog state AllEventsUpToLockProcessed``` 

And it did the cut-over

shlomi-noach commented 4 years ago

Are you running directly on the master? (--host flag). If not, can you do that?

shlomi-noach commented 4 years ago

Ha ok!

knepe commented 4 years ago

Haha :) Finally, nice catch about that binlog filtering thing @shlomi-noach :) Should maybe be added to the RDS docs?

shlomi-noach commented 4 years ago

Should maybe be added to the RDS docs?

@knepe sure, it should. Now, I am not using RDS/Aurora myself and have little understanding of the configuration involved. Would you be willing to create a pull request?

knepe commented 4 years ago

Yes, sure I'll make a PR :) Closing this and thanks for your help!

knepe commented 4 years ago

Re-opening this, because I've had random success with this unfortunately, so I'm not sure this parameter actually helps. I've ran the same migration on 3 clusters now with that parameter set to 0, and it worked for 2/3 clusters. I have checked the parameters and they are the same for all clusters. So I'm not sure if I got lucky on those 2 clusters where it worked. Also ran a much smaller migration on one of the clusters that the original migration worked on, but it is also failing now. The error and logs are the same as before.

What I've found, when it DO work, it goes into some state where it spams this:

2019-12-17 10:27:47 ERROR Timeout while waiting for events up to lock
2019-12-17 10:27:47 ERROR 2019-12-17 10:27:47 ERROR Timeout while waiting for events up to lock
2019-12-17 10:27:47 INFO Looking for magic cut-over table
2019-12-17 10:27:47 INFO Will now proceed to drop magic table and unlock tables
2019-12-17 10:27:47 INFO Dropping magic cut-over table
2019-12-17 10:27:47 INFO Dropping magic cut-over table
2019-12-17 10:27:47 INFO Dropping table `xxx`.`xxx_del`
2019-12-17 10:27:48 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`xxx_del`
2019-12-17 10:27:48 INFO Table dropped
2019-12-17 10:27:48 INFO Tables unlocked
2019-12-17 10:27:48 INFO Grabbing voluntary lock: gh-ost.1339250.lock
2019-12-17 10:27:48 INFO Setting LOCK timeout as 6 seconds
2019-12-17 10:27:48 INFO Looking for magic cut-over table
2019-12-17 10:27:48 INFO Creating magic cut-over table `xxx`.`xxx_del`
2019-12-17 10:27:48 INFO Magic cut-over table created
2019-12-17 10:27:48 INFO Locking `xxx`.`xxx`, `xxx`.`xxx_del`
2019-12-17 10:27:48 INFO Tables locked
2019-12-17 10:27:48 INFO Session locking original & magic tables is 1339250
2019-12-17 10:27:48 INFO Writing changelog state: AllEventsUpToLockProcessed:1576578468842536707
2019-12-17 10:27:48 INFO Waiting for events up to lock
2019-12-17 10:27:51 ERROR Timeout while waiting for events up to lock
2019-12-17 10:27:51 ERROR 2019-12-17 10:27:51 ERROR Timeout while waiting for events up to lock
2019-12-17 10:27:51 INFO Looking for magic cut-over table
2019-12-17 10:27:51 INFO Will now proceed to drop magic table and unlock tables
2019-12-17 10:27:51 INFO Dropping magic cut-over table
2019-12-17 10:27:51 INFO Dropping magic cut-over table
2019-12-17 10:27:51 INFO Dropping table `xxx`.`xxx_del`
2019-12-17 10:27:52 INFO Table dropped
2019-12-17 10:27:52 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`xxx_del`
2019-12-17 10:27:52 INFO Tables unlocked
2019-12-17 10:27:52 INFO Grabbing voluntary lock: gh-ost.1339247.lock
2019-12-17 10:27:52 INFO Setting LOCK timeout as 6 seconds
2019-12-17 10:27:52 INFO Looking for magic cut-over table
2019-12-17 10:27:52 INFO Creating magic cut-over table `xxx`.`xxx_del`
2019-12-17 10:27:52 INFO Magic cut-over table created
2019-12-17 10:27:52 INFO Locking `xxx`.`xxx`, `xxx`.`xxx_del`
2019-12-17 10:27:52 INFO Tables locked
2019-12-17 10:27:52 INFO Session locking original & magic tables is 1339247
2019-12-17 10:27:52 INFO Writing changelog state: AllEventsUpToLockProcessed:1576578472423229562
2019-12-17 10:27:52 INFO Waiting for events up to lock

If you check the timestamp, it becomes very aggresive, and after it has spammed this for awhile, it eventually starts spamming these logs:

2019-12-17 10:28:37 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-17 10:28:37 INFO Handled changelog state AllEventsUpToLockProcessed

and it will complete the migration after this.

knepe commented 4 years ago

FYI: I managed to solve the failing migration by increasing retries to like 5000. Then it eventually worked after like 1000+ retries. Does not seem that stable though

romberli commented 4 years ago

any news? i got the same problem~ it seems that it happens on large table, in my situation, the table is 58GB, after failed running, i dropped the _gho table manually, it took nearly 20seconds, is it related to cut-over phase?

wileyveteran commented 4 years ago

I'm also seeing this on AWS Aurora (bin log filtering disabled, running on master). Works sometimes not others. I don't think it is a timeout issue related to cutover because I have it set at 10 seconds and when I do it manually it only take ~5 secs. This was a 1 hour process. Don't really want to try it on a much longer one if I am going to get to the end and have it fail.

Update: Looks like my issue was related to the exact-rowcount option issue mentioned here https://github.com/github/gh-ost/issues/830

zhenxingCN commented 3 years ago

Any news? in my situation,the table is 1.5T, and cut-over phase same like this.

/opt/gh-ost --conf='/opt/ghost.conf' \ --database=test \ --table=oh_order_sales \ --alter='CHANGE COLUMN ORDER_ID ORDER_ID VARCHAR(150)' \ --max-load=Threads_running=100 \ --critical-load=Threads_running=1000 \ --critical-load-interval-millis=100 \ --chunk-size=100000 \ --host=172.20.134.3 \ --port=3306 \ --verbose \ --cut-over=default \ --default-retries=10 \ --allow-on-master \ --initially-drop-ghost-table \ --dml-batch-size=100 \ --assume-rbr \ --assume-master-host=172.20.134.3:3306 \ --timestamp-old-table \ --panic-flag-file=/tmp/sbtest1.panic \ --throttle-flag-file=/tmp/sbtest1.throttle \ --postpone-cut-over-flag-file=/tmp/sbtest1.postpone \ --serve-socket-file=/tmp/sbtest1.gh-ost.socket \ --stack \ --debug \ --execute 2>&1 | tee gh-ost_execute.log

gh-ost_execute.log

ccoffey commented 3 years ago

My company ran into the exact same problem while using gh-ost to run migrations against our Aurora clusters.

After a lot of deep diving, we finally put together this PR which makes the problem visible and protects against cutOver failures.

Now when we run migrations against Aurora, we can see that the value of HeartbeatLag grows larger and larger during a long running migration on a busy table:

Copy: 23000/2172914 1.1%; Applied: 0; Backlog: 0/1000; Time: 1m10s(total), 3s(copy); streamer: mysql-bin-changelog.046700:117936636; Lag: 0.12s, HeartbeatLag: 0.12s, State: migrating; ETA: 4m41s 
Copy: 75000/2172914 3.5%; Applied: 0; Backlog: 0/1000; Time: 1m15s(total), 8s(copy); streamer: mysql-bin-changelog.046700:128606180; Lag: 0.12s, HeartbeatLag: 0.12s, State: migrating; ETA: 3m44s 
Copy: 128500/2172914 5.9%; Applied: 0; Backlog: 0/1000; Time: 1m20s(total), 13s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 2.52s, State: migrating; ETA: 3m26s 
Copy: 179000/2172914 8.2%; Applied: 0; Backlog: 0/1000; Time: 1m25s(total), 18s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 7.52s, State: migrating; ETA: 3m20s 
Copy: 228000/2172914 10.5%; Applied: 0; Backlog: 0/1000; Time: 1m30s(total), 23s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.12s, HeartbeatLag: 12.52s, State: migrating; ETA: 3m16s 
Copy: 279500/2172914 12.9%; Applied: 0; Backlog: 0/1000; Time: 1m35s(total), 28s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 17.52s, State: migrating; ETA: 3m9s 
Copy: 330000/2172914 15.2%; Applied: 0; Backlog: 0/1000; Time: 1m40s(total), 33s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 22.52s, State: migrating; ETA: 3m4s 
Copy: 378000/2172914 17.4%; Applied: 0; Backlog: 0/1000; Time: 1m45s(total), 38s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 27.52s, State: migrating; ETA: 3m0s 
Copy: 427000/2172914 19.7%; Applied: 0; Backlog: 0/1000; Time: 1m50s(total), 43s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 32.52s, State: migrating; ETA: 2m55s 
Copy: 477500/2172914 22.0%; Applied: 0; Backlog: 0/1000; Time: 1m55s(total), 48s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 37.52s, State: migrating; ETA: 2m50s 
Copy: 530000/2172914 24.4%; Applied: 0; Backlog: 0/1000; Time: 2m0s(total), 53s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 42.52s, State: migrating; ETA: 2m44s

By the time we get to the cut-over stage, HeartbeatLag can be minutes. This means that there is absolutely no chance that cut-over could succeed, and instead, it will timeout after 3 seconds with ERROR Timeout while waiting for events up to lock. This will then retry up to 60 times.

In the PR that we have submitted, we simply wait before cutOver for the HeartbeatLag to reduce to a safe level before continuing. This looks as follows:

Row copy complete
Waiting for heartbeat lag to be low enough to proceed
current HeartbeatLag (44.16s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue

Eventually, HeartbeatLag will reduce enough to proceed with cutOver succeeded on the first try.

Heartbeat lag is low enough, proceeding

Update: Because of the visibility into HeartbeatLag that this PR enabled, we were able to experiment with a number of Aurora parameters and eventually figure out why HeartbeatLag was growing so quickly.

On our clusters, the parameter aurora_binlog_replication_max_yield_seconds defaulted to 60 seconds. By changing this value to 0 (on a test cluster), we were able to completely remove the HeartbeatLag that we were seeing.

Instead, we started to see the following while running migrations:

Copy: 2342000/2256823 103.8%; Applied: 0; Backlog: 0/1000; Time: 4m0s(total), 3m51s(copy); streamer: mysql-bin-changelog.046761:69254547; Lag: 0.11s, HeartbeatLag: 0.11s, State: migrating; ETA: due
2021-02-03 13:57:36 INFO Row copy complete
Copy: 2401160/2401160 100.0%; Applied: 0; Backlog: 0/1000; Time: 4m5s(total), 3m56s(copy); streamer: mysql-bin-changelog.046761:81901819; Lag: 0.01s, HeartbeatLag: 0.07s, State: migrating; ETA: due 
2021-02-03 13:57:36 INFO Waiting for heartbeat lag to be low enough to proceed
2021-02-03 13:57:36 INFO Heartbeat lag is low enough, proceeding

This parameter is described in Aurora's Optimizing binary log replication doc.

Warning: We have not changed this parameter in our production environment as it may not be safe for us to do so. We are actually leaning towards keeping this value at its default of 60 seconds and instead relying on the safety added to gh-ost by this PR.

timvaillancourt commented 3 years ago

@knepe https://github.com/github/gh-ost/pull/921 is now merged to master. Can you confirm this fixes the issue?

bbuchalter commented 3 years ago

Should https://github.com/github/gh-ost/pull/805 be reverted?

shlomi-noach commented 3 years ago

Should #805 be reverted?

@bbuchalter what is the context for reverting #805? It's an Aurora documentation PR?

bbuchalter commented 3 years ago

@shlomi-noach it's based on this comment:

Re-opening this, because I've had random success with this unfortunately, so I'm not sure this parameter actually helps.

FWIW, Gusto does not use this advice and has not encountered this issue. It seems this may not be a hard requirement?

shlomi-noach commented 3 years ago

All, I have no insight into this as I'm not using RDS myself.

bbuchalter commented 3 years ago

As a data point, Gusto uses RDS and we do not have aurora_enable_repl_bin_log_filtering set to zero while using gh-ost.

github / gh-ost

ERROR Timeout while waiting for events up to lock #799