github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.44k stars 1.26k forks source link

Heartbeat Lag with Aurora blocking cutover. #1081

Open myusuf3 opened 2 years ago

myusuf3 commented 2 years ago

Hello everyone!

I am currently running a long running migration on Aurora MySQL 5.7. I running it on a copy of my production database. With Aurora style replication (meaning not actual replication) I have successfully run it on smaller tables, but when I run a migration on the scale of 20-24 hours I keep running into HeartbeatLag that always increases and never returns once it starts to drift.

Here is a sample output from a recent run.

[2022/01/23 20:42:52] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011973, 4)
[2022/01/23 20:42:52] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011973, 4)
2022-01-23 20:42:52 INFO rotate to next log from mysql-bin-changelog.011973:137795900 to mysql-bin-changelog.011973
2022-01-23 20:42:52 INFO rotate to next log from mysql-bin-changelog.011973:0 to mysql-bin-changelog.011973
Copy: 111620000/904947480 12.3%; Applied: 0; Backlog: 0/1000; Time: 3h11m30s(total), 3h11m30s(copy); streamer: mysql-bin-changelog.011973:41926866; Lag: 0.02s, HeartbeatLag: 871.02s, State: migrating; ETA: 22h41m3s
[2022/01/23 20:43:27] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011974, 4)
[2022/01/23 20:43:27] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011974, 4)
2022-01-23 20:43:27 INFO rotate to next log from mysql-bin-changelog.011974:138079395 to mysql-bin-changelog.011974
2022-01-23 20:43:27 INFO rotate to next log from mysql-bin-changelog.011974:0 to mysql-bin-changelog.011974
Copy: 111880000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h12m0s(total), 3h12m0s(copy); streamer: mysql-bin-changelog.011974:23055116; Lag: 0.02s, HeartbeatLag: 892.02s, State: migrating; ETA: 22h41m0s
[2022/01/23 20:44:02] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011975, 4)
[2022/01/23 20:44:02] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011975, 4)
2022-01-23 20:44:02 INFO rotate to next log from mysql-bin-changelog.011975:139227492 to mysql-bin-changelog.011975
2022-01-23 20:44:02 INFO rotate to next log from mysql-bin-changelog.011975:0 to mysql-bin-changelog.011975
Copy: 112140000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h12m30s(total), 3h12m30s(copy); streamer: mysql-bin-changelog.011975:2853151; Lag: 0.02s, HeartbeatLag: 911.92s, State: migrating; ETA: 22h40m56s
Copy: 112400000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h13m0s(total), 3h13m0s(copy); streamer: mysql-bin-changelog.011975:121871428; Lag: 0.02s, HeartbeatLag: 931.22s, State: migrating; ETA: 22h40m52s
[2022/01/23 20:44:40] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011976, 4)
[2022/01/23 20:44:40] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011976, 4)
2022-01-23 20:44:40 INFO rotate to next log from mysql-bin-changelog.011976:148195656 to mysql-bin-changelog.011976
2022-01-23 20:44:40 INFO rotate to next log from mysql-bin-changelog.011976:0 to mysql-bin-changelog.011976
Copy: 112650000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h13m30s(total), 3h13m30s(copy); streamer: mysql-bin-changelog.011976:92692665; Lag: 0.02s, HeartbeatLag: 951.42s, State: migrating; ETA: 22h40m56s
[2022/01/23 20:45:15] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011977, 4)
[2022/01/23 20:45:15] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011977, 4)
2022-01-23 20:45:15 INFO rotate to next log from mysql-bin-changelog.011977:140138087 to mysql-bin-changelog.011977
2022-01-23 20:45:15 INFO rotate to next log from mysql-bin-changelog.011977:0 to mysql-bin-changelog.011977
Copy: 112910000/904947480 12.5%; Applied: 0; Backlog: 0/1000; Time: 3h14m0s(total), 3h14m0s(copy); streamer: mysql-bin-changelog.011977:71588220; Lag: 0.02s, HeartbeatLag: 971.02s, State: migrating; ETA: 22h40m51s

I have completed this same migration up until the cut over phase and it was just blocked there due to HeartbeatLag being so high by the time the migration completed. If really odd when the HeartbeatLag kicks in and simply doesn't return to lower values ever increasing from then on out.

I have also set the binlog retention to 72 hours.

call mysql.rds_set_configuration('binlog retention hours', 72);

I even tried to forcibly trigger the cutover and it replied with nothing is blocking the cutover.

I have followed all the recommended Aurora settings aside from running the migration on master directly.

Here is the command I ran.

 gh-ost \
--max-load=Threads_running=100 \
--critical-load=Threads_running=1000 \
--chunk-size=10000 \
--throttle-control-replicas="" \
--max-lag-millis=1500 \
--user="" \
--password="" \
--host=db \
--allow-on-master \
--database="" \
--table="" \
--verbose \
--alter="" \
--switch-to-rbr \
--allow-master-master \
--cut-over=default \
--exact-rowcount \
--concurrent-rowcount \
--default-retries=120 \
--panic-flag-file=/tmp/ghost.panic.flag \
--postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
--initially-drop-ghost-table \
--execute

Any guidance from other aurora users would be appreciated.

zhang17310521020 commented 2 years ago

hello,i have same issue ,the question is saved?

myusuf3 commented 2 years ago

hello,i have same issue ,the question is saved?

Nope.

Dasiu commented 6 days ago

Bump.

I am testing gh-ost and observed very similar issue. In my case it was CloudSQL MySQL (MySQL in GCP). I executed gh-ost on DB host without replica. Migration was long ~20 hours and was throttled a few times and connection was lost and recovered a few times. At some point "Lag" and "HeartbeatLag" drifted and never went down. Ultimately all rows were copied and all binglogs changes applied but gh-ost was forever applying binglogs events (the only ones left were the ones it itself generated, there was no more traffic on this DB host anymore).

zhang17310521020 commented 6 days ago

Finally, I dealt with this problem by using online ddl.

发件人:Dasiu @.> 发送时间:2024年11月21日(星期四) 17:57 收件人:github/gh-ost @.> 抄 送:zhang17310521020 @.>; Comment @.> 主 题:Re: [github/gh-ost] Heartbeat Lag with Aurora blocking cutover. (Issue #1081) Bump. I am testing gh-ost and observed very similar issue. In my case it was CloudSQL MySQL (MySQL in GCP). I executed gh-ost on DB host without replica. Migration was long ~20 hours and was throttled a few times and connection was lost and recovered a few times. At some point "Lag" and "HeartbeatLag" drifted and never went down. Ultimately all rows were copied and all binglogs changes applied but gh-ost was forever applying binglogs events (the only ones left were the ones it itself generated, there was no more traffic on this DB host anymore). — Reply to this email directly, view it on GitHub <https://github.com/github/gh-ost/issues/1081#issuecomment-2490621880 >, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYWCVYN6XKXERUB4BZI7XSD2BWVBLAVCNFSM6AAAAABSGSUOI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJQGYZDCOBYGA >. You are receiving this because you commented.Message ID: @.***>