github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.4k stars 1.26k forks source link

gh-ost in a loop of not being able to cut-over. #784

Open tomkrouper opened 5 years ago

tomkrouper commented 5 years ago

Situation:

Migration was 100% complete except for the binlog "Backlog" was at 1000/1000 and I didn't realize several hours behind and I attempted to cut-over.

Cut-over:

--postpone-cut-over-flag-file was set. When I attempted to cut-over the cutover timed out. We do a socket command to unpostpone our migrations. Here is the log.

Copy: 1504139030/1504139030 100.0%; Applied: 958335966; Backlog: 1000/1000; Time: 146h50m49s(total), 126h57m23s(copy); streamer: mysql-bin.151538:57981919; State: postponing cut-over; ETA: due
2019-09-29 14:43:15 INFO Grabbing voluntary lock: gh-ost.220158630.lock
2019-09-29 14:43:15 INFO Setting LOCK timeout as 6 seconds
2019-09-29 14:43:15 INFO Looking for magic cut-over table
2019-09-29 14:43:15 INFO Creating magic cut-over table `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:15 INFO Magic cut-over table created
2019-09-29 14:43:15 INFO Locking `my_database`.`my_table`, `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:15 INFO Tables locked
2019-09-29 14:43:15 INFO Session locking original & magic tables is 220158630
2019-09-29 14:43:15 INFO Writing changelog state: AllEventsUpToLockProcessed:1569793395598070384
2019-09-29 14:43:15 INFO Waiting for events up to lock
Copy: 1504139030/1504139030 100.0%; Applied: 958341156; Backlog: 1000/1000; Time: 146h50m50s(total), 126h57m23s(copy); streamer: mysql-bin.151538:65463369; State: migrating; ETA: due
2019-09-29 14:43:18 ERROR Timeout while waiting for events up to lock
2019-09-29 14:43:18 ERROR 2019-09-29 14:43:18 ERROR Timeout while waiting for events up to lock
2019-09-29 14:43:18 INFO Looking for magic cut-over table
2019-09-29 14:43:18 INFO Will now proceed to drop magic table and unlock tables
2019-09-29 14:43:18 INFO Dropping magic cut-over table
2019-09-29 14:43:18 INFO Releasing lock from `my_database`.`my_table`, `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:18 INFO Tables unlocked
2019-09-29 14:43:19 INFO executing gh-ost-on-begin-postponed hook: /data/github/shell/bin/gh-ost-hooks/prod/gh-ost-on-begin-postponed-hook

Copy: 1504139030/1504139030 100.0%; Applied: 958355546; Backlog: 1000/1000; Time: 146h50m55s(total), 126h57m23s(copy); streamer: mysql-bin.151538:88596107; State: postponing cut-over; ETA: due

Where I probably screwed up to expose the bug:

In my hopes to get the cut-over to work, I went to the host that was running the migration and deleted the postpone flag file. This didn't seem to work, so after a short time I put the flag in place. Shortly after this @ggunson pointed out the binary lag to me.

So we waited. When things looked like they were close to caught up again, I attempted the socket cut-over again. It failed, but then we noticed something in the stream. It stopped making any progress. It sat on the same binary log and position. We waited, it stayed the same. Then we tried to do a cut-over again. The stream moved forward again. (Then got stuck again). At this point I tried to just delete the postpone file again, and I watched the migration attempt and fail to cut-over several times before the migration itself failed.

Is this a bug?

Probably yes. Is it going to be easy to reproduce? Probably no. If you have any thoughts, let me know.

ggunson commented 5 years ago

Some extra information on the incident:

Deleting the postpone flag would have never worked here because gh-ost was always going to be hours behind in replication, and it needs to be caught up in order to cut-over.

"Backlog 1000/1000" is not a great status metric to give, since it's the same value whether it's 1000 events behind or 1 billion. A table is not ready for cut-over if gh-ost is even minutes behind in replication on a busy cluster (it was possibly fine when it first announced it was ready to cut-over, but wasn't when it said it after each failed cut-over command). And finding how far behind gh-ost really is requires looking at the replica's current binlog file, which is an outside step to take and not documented.

I think we could possibly repeat this bug using a delayed replica.

ggunson commented 5 years ago

This has repeated itself with the second execution of the same migration.

It said it was ready to cut over, and it had only been 10 minutes between that announcement and my running the command.

Its last status output before that was

Copy: 1470149100/1541034958 95.4%; Applied: 95986884; Backlog: 0/1000; Time: 46h0m0s(total), 45h59m58s(copy); streamer: mysql-bin.154942:103213077; State: migrating; ETA: 2h13m4s

All status output after the first cut-over attempt say

Copy: 1493042250/1493042250 100.0%; Applied: 95986884; Backlog: 0/1000; Time: 46h39m29s(total), 46h27m57s(copy); streamer: mysql-bin.154942:103213077; State: postponing cut-over; ETA: due

Copy: 1493042250/1493042250 100.0%; Applied: 95986884; Backlog: 0/1000; Time: 46h43m53s(total), 46h27m57s(copy); streamer: mysql-bin.154942:103213077; State: postponing cut-over; ETA: due

The times are updated but the streamer value stays the same, so I'd guess that means it's blocked (I don't have easy access to the screen/tmux running the migration session to check).

shlomi-noach commented 5 years ago

2nd execution failure was caused by a restart (reboot, actually) on the inspected server; gh-ost apparently retried enough times to see it up again and continued running, without actually making progress. See https://github.com/github/gh-ost/issues/785

shlomi-noach commented 5 years ago

A few thoughts before attempting to reproduce the behavior:

Back to the issue -- I'm merely guessing the logs from that 1st execution are not available anymore. I will try and devise a plan to reproduce based on @ggunson idea of delayed replica.

acharis commented 4 years ago

i think i might have seen this as well.

 PROD - iad02/vt  [vitess@redacteddb-0-1/mysql /]$ echo status | nc -U /tmp/gh-ost.RedactedDb.redacted_table.sock
# Migrating `RedactedDb`.`redacted_table`; Ghost table is `RedactedDb`.`_redacted_table_gho`
# Migrating redacteddb-0-2:3306; inspecting redacteddb-0-1:3306; executing on redacteddb-0-1
# Migration started at Fri Nov 15 22:07:40 +0000 2019
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=100; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# throttle-control-replicas count: 3
# postpone-cut-over-flag-file: /tmp/ghost-cutover [set]
# Serving on unix socket: /tmp/gh-ost.RedactedDb.redacted_table.sock
Copy: 49272862/49272862 100.0%; Applied: 37189163; Backlog: 1000/1000; Time: 13h28m26s(total), 7h32m9s(copy); streamer: vt-1060060001-bin.045687:856158742; State: postponing cut-over; ETA: due
 PROD - iad02/vt  [vitess@redacteddb-0-1/mysql /]$ rm /tmp/ghost-cutover
 PROD - iad02/vt  [vitess@redacteddb-0-1/mysql /]$ echo status | nc -U /tmp/gh-ost.RedactedDb.redacted_table.sock
# Migrating `RedactedDb`.`redacted_table`; Ghost table is `RedactedDb`.`_redacted_table_gho`
# Migrating redacteddb-0-2:3306; inspecting redacteddb-0-1:3306; executing on redacteddb-0-1
# Migration started at Fri Nov 15 22:07:40 +0000 2019
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=100; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# throttle-control-replicas count: 3
# postpone-cut-over-flag-file: /tmp/ghost-cutover
# Serving on unix socket: /tmp/gh-ost.RedactedDb.redacted_table.sock
Copy: 49272862/49272862 100.0%; Applied: 37198403; Backlog: 1000/1000; Time: 13h28m38s(total), 7h32m9s(copy); streamer: vt-1060060001-bin.045687:958432539; State: migrating; ETA: due

i still have the _ghc table if that would help. i don't see any log file anywhere, so we don't have that...

some time after the above interaction via the socket, gh-ost was no longer running, however the cutover didn't happen. if this is different from the above, i'm happy to open another issue.

luke-deep-space commented 1 year ago

it needs to be caught up in order to cut-over

This comment helped me. I didn't realise gh-ost wanted replication to be completely caught up before cutover. I thought it may cutover if it knew the target table was not impacted by any pending replication actions but I guess I was mistaken.