gh-ost in a loop of not being able to cut-over.

tomkrouper commented 5 years ago

Situation:

Migration was 100% complete except for the binlog "Backlog" was at 1000/1000 and I didn't realize several hours behind and I attempted to cut-over.

Cut-over:

--postpone-cut-over-flag-file was set. When I attempted to cut-over the cutover timed out. We do a socket command to unpostpone our migrations. Here is the log.

Copy: 1504139030/1504139030 100.0%; Applied: 958335966; Backlog: 1000/1000; Time: 146h50m49s(total), 126h57m23s(copy); streamer: mysql-bin.151538:57981919; State: postponing cut-over; ETA: due
2019-09-29 14:43:15 INFO Grabbing voluntary lock: gh-ost.220158630.lock
2019-09-29 14:43:15 INFO Setting LOCK timeout as 6 seconds
2019-09-29 14:43:15 INFO Looking for magic cut-over table
2019-09-29 14:43:15 INFO Creating magic cut-over table `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:15 INFO Magic cut-over table created
2019-09-29 14:43:15 INFO Locking `my_database`.`my_table`, `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:15 INFO Tables locked
2019-09-29 14:43:15 INFO Session locking original & magic tables is 220158630
2019-09-29 14:43:15 INFO Writing changelog state: AllEventsUpToLockProcessed:1569793395598070384
2019-09-29 14:43:15 INFO Waiting for events up to lock
Copy: 1504139030/1504139030 100.0%; Applied: 958341156; Backlog: 1000/1000; Time: 146h50m50s(total), 126h57m23s(copy); streamer: mysql-bin.151538:65463369; State: migrating; ETA: due
2019-09-29 14:43:18 ERROR Timeout while waiting for events up to lock
2019-09-29 14:43:18 ERROR 2019-09-29 14:43:18 ERROR Timeout while waiting for events up to lock
2019-09-29 14:43:18 INFO Looking for magic cut-over table
2019-09-29 14:43:18 INFO Will now proceed to drop magic table and unlock tables
2019-09-29 14:43:18 INFO Dropping magic cut-over table
2019-09-29 14:43:18 INFO Releasing lock from `my_database`.`my_table`, `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:18 INFO Tables unlocked
2019-09-29 14:43:19 INFO executing gh-ost-on-begin-postponed hook: /data/github/shell/bin/gh-ost-hooks/prod/gh-ost-on-begin-postponed-hook

Copy: 1504139030/1504139030 100.0%; Applied: 958355546; Backlog: 1000/1000; Time: 146h50m55s(total), 126h57m23s(copy); streamer: mysql-bin.151538:88596107; State: postponing cut-over; ETA: due

Where I probably screwed up to expose the bug:

In my hopes to get the cut-over to work, I went to the host that was running the migration and deleted the postpone flag file. This didn't seem to work, so after a short time I put the flag in place. Shortly after this @ggunson pointed out the binary lag to me.

So we waited. When things looked like they were close to caught up again, I attempted the socket cut-over again. It failed, but then we noticed something in the stream. It stopped making any progress. It sat on the same binary log and position. We waited, it stayed the same. Then we tried to do a cut-over again. The stream moved forward again. (Then got stuck again). At this point I tried to just delete the postpone file again, and I watched the migration attempt and fail to cut-over several times before the migration itself failed.

Is this a bug?

Probably yes. Is it going to be easy to reproduce? Probably no. If you have any thoughts, let me know.

ggunson commented 5 years ago

Some extra information on the incident:

gh-ost announced it was ready to cut-over 15 or so hours before the manual cut-over command was run
in the meantime, a big pt-archiver delete job ran causing gh-ost's binlog parsing to get further and further behind, though this wasn't evident without looking at the streamer info and comparing it with the replica's master status
gh-ost announced it was ready to cut-over:
- when the row copy was done
- whenever someone tried and failed to run the "cut-over" command (this happened several times, because nobody has been taught what the "backlog" meant, and gh-ost kept saying the table was ready to cut-over)
when the binlog parsing finally caught up with the first cut-over command event it encountered, it didn't know that it was 8+ hours old, and just stopped parsing binlogs waiting for the cut-over to happen. The only way to get it parsing again was to run another cut-over command, but then it would stop parsing again at the next cut-over command. Meaning it was always going to be several cut-over commands behind in replication and would never catch up.

Deleting the postpone flag would have never worked here because gh-ost was always going to be hours behind in replication, and it needs to be caught up in order to cut-over.

"Backlog 1000/1000" is not a great status metric to give, since it's the same value whether it's 1000 events behind or 1 billion. A table is not ready for cut-over if gh-ost is even minutes behind in replication on a busy cluster (it was possibly fine when it first announced it was ready to cut-over, but wasn't when it said it after each failed cut-over command). And finding how far behind gh-ost really is requires looking at the replica's current binlog file, which is an outside step to take and not documented.

I think we could possibly repeat this bug using a delayed replica.

ggunson commented 5 years ago

This has repeated itself with the second execution of the same migration.

It said it was ready to cut over, and it had only been 10 minutes between that announcement and my running the command.

Its last status output before that was

Copy: 1470149100/1541034958 95.4%; Applied: 95986884; Backlog: 0/1000; Time: 46h0m0s(total), 45h59m58s(copy); streamer: mysql-bin.154942:103213077; State: migrating; ETA: 2h13m4s

All status output after the first cut-over attempt say

Copy: 1493042250/1493042250 100.0%; Applied: 95986884; Backlog: 0/1000; Time: 46h39m29s(total), 46h27m57s(copy); streamer: mysql-bin.154942:103213077; State: postponing cut-over; ETA: due

Copy: 1493042250/1493042250 100.0%; Applied: 95986884; Backlog: 0/1000; Time: 46h43m53s(total), 46h27m57s(copy); streamer: mysql-bin.154942:103213077; State: postponing cut-over; ETA: due

The times are updated but the streamer value stays the same, so I'd guess that means it's blocked (I don't have easy access to the screen/tmux running the migration session to check).

shlomi-noach commented 5 years ago

2nd execution failure was caused by a restart (reboot, actually) on the inspected server; gh-ost apparently retried enough times to see it up again and continued running, without actually making progress. See https://github.com/github/gh-ost/issues/785

shlomi-noach commented 5 years ago

A few thoughts before attempting to reproduce the behavior:

https://github.com/github/gh-ost/blob/master/doc/understanding-output.md#progress has an explanation of backlog output.
Output of current lag on inspected server will be very useful.
Cutting over when there's a high lag should be rejected.

Back to the issue -- I'm merely guessing the logs from that 1st execution are not available anymore. I will try and devise a plan to reproduce based on @ggunson idea of delayed replica.

acharis commented 4 years ago

i think i might have seen this as well.

 PROD - iad02/vt  [vitess@redacteddb-0-1/mysql /]$ echo status | nc -U /tmp/gh-ost.RedactedDb.redacted_table.sock
# Migrating `RedactedDb`.`redacted_table`; Ghost table is `RedactedDb`.`_redacted_table_gho`
# Migrating redacteddb-0-2:3306; inspecting redacteddb-0-1:3306; executing on redacteddb-0-1
# Migration started at Fri Nov 15 22:07:40 +0000 2019
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=100; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# throttle-control-replicas count: 3
# postpone-cut-over-flag-file: /tmp/ghost-cutover [set]
# Serving on unix socket: /tmp/gh-ost.RedactedDb.redacted_table.sock
Copy: 49272862/49272862 100.0%; Applied: 37189163; Backlog: 1000/1000; Time: 13h28m26s(total), 7h32m9s(copy); streamer: vt-1060060001-bin.045687:856158742; State: postponing cut-over; ETA: due
 PROD - iad02/vt  [vitess@redacteddb-0-1/mysql /]$ rm /tmp/ghost-cutover
 PROD - iad02/vt  [vitess@redacteddb-0-1/mysql /]$ echo status | nc -U /tmp/gh-ost.RedactedDb.redacted_table.sock
# Migrating `RedactedDb`.`redacted_table`; Ghost table is `RedactedDb`.`_redacted_table_gho`
# Migrating redacteddb-0-2:3306; inspecting redacteddb-0-1:3306; executing on redacteddb-0-1
# Migration started at Fri Nov 15 22:07:40 +0000 2019
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=100; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# throttle-control-replicas count: 3
# postpone-cut-over-flag-file: /tmp/ghost-cutover
# Serving on unix socket: /tmp/gh-ost.RedactedDb.redacted_table.sock
Copy: 49272862/49272862 100.0%; Applied: 37198403; Backlog: 1000/1000; Time: 13h28m38s(total), 7h32m9s(copy); streamer: vt-1060060001-bin.045687:958432539; State: migrating; ETA: due

i still have the _ghc table if that would help. i don't see any log file anywhere, so we don't have that...

some time after the above interaction via the socket, gh-ost was no longer running, however the cutover didn't happen. if this is different from the above, i'm happy to open another issue.

luke-deep-space commented 1 year ago

it needs to be caught up in order to cut-over

This comment helped me. I didn't realise gh-ost wanted replication to be completely caught up before cutover. I thought it may cutover if it knew the target table was not impacted by any pending replication actions but I guess I was mistaken.

github / gh-ost