github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.43k stars 1.26k forks source link

[Question] Drop of _ghc table after succesful completion #886

Open s4mur4i opened 4 years ago

s4mur4i commented 4 years ago

Hello,

I would have a question regarding the cleanup phase of gh-ost. After running an alter, gh-ost exited successfully. Here are the logs from that run:

Copy: 66215590/66215590 100.0%; Applied: 958916; Backlog: 408/1000; Time: 35m30s(total), 33m55s(copy); streamer: mysql-bin.003556:178345318; Lag: 0.04s, State: postponing cut-over; ETA: due
2020-09-23 13:02:11 INFO Grabbing voluntary lock: gh-ost.3470966548.lock
2020-09-23 13:02:11 INFO Setting LOCK timeout as 6 seconds
2020-09-23 13:02:11 INFO Looking for magic cut-over table
2020-09-23 13:02:11 INFO Creating magic cut-over table `table`.`_users_del`
2020-09-23 13:02:11 INFO Magic cut-over table created
2020-09-23 13:02:11 INFO Locking `table`.`users`, `table`.`_users_del`
2020-09-23 13:02:11 INFO Tables locked
2020-09-23 13:02:11 INFO Session locking original & magic tables is 3470966548
2020-09-23 13:02:11 INFO Writing changelog state: AllEventsUpToLockProcessed:1600866131471181760
2020-09-23 13:02:11 INFO Waiting for events up to lock
2020-09-23 13:02:11 INFO Intercepted changelog state AllEventsUpToLockProcessed
2020-09-23 13:02:11 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 66215590/66215590 100.0%; Applied: 965676; Backlog: 146/1000; Time: 35m45s(total), 33m55s(copy); streamer: mysql-bin.003556:212318228; Lag: 0.04s, State: migrating; ETA: due
2020-09-23 13:02:12 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1600866131471181760
2020-09-23 13:02:12 INFO Done waiting for events up to lock; duration=753.747866ms
# Migrating `table`.`users`; Ghost table is `table`.`_users_gho`
# Migrating primary.address:3306; inspecting secondary.address:3306; executing on secondary.address
# Migration started at Wed Sep 23 12:26:26 +0000 2020
# chunk-size: 10000; max-lag-millis: 1000ms; dml-batch-size: 10; max-load: Threads_running=100; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag
# panic-flag-file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.table.users.sock
Copy: 66215590/66215590 100.0%; Applied: 965821; Backlog: 0/1000; Time: 35m46s(total), 33m55s(copy); streamer: mysql-bin.003556:212618836; Lag: 0.04s, State: migrating; ETA: due
2020-09-23 13:02:12 INFO Setting RENAME timeout as 3 seconds
2020-09-23 13:02:12 INFO Session renaming tables is 3471207277
2020-09-23 13:02:12 INFO Issuing and expecting this to block: rename /* gh-ost */ table `table`.`users` to `table`.`_users_del`, `table`.`_users_gho` to `table`.`users`
2020-09-23 13:02:12 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
2020-09-23 13:02:12 INFO Checking session lock: gh-ost.3470966548.lock
2020-09-23 13:02:12 INFO Connection holding lock on original table still exists
2020-09-23 13:02:12 INFO Will now proceed to drop magic table and unlock tables
2020-09-23 13:02:12 INFO Dropping magic cut-over table
2020-09-23 13:02:12 INFO Releasing lock from `table`.`users`, `table`.`_users_del`
2020-09-23 13:02:12 INFO Tables unlocked
2020-09-23 13:02:12 INFO Tables renamed
2020-09-23 13:02:12 INFO Lock & rename duration: 1.083016852s. During this time, queries on `users` were blocked
2020-09-23 13:02:12 INFO Looking for magic cut-over table
[2020/09/23 13:02:12] [info] binlogsyncer.go:164 syncer is closing...
[2020/09/23 13:02:12] [error] binlogsyncer.go:631 connection was bad
[2020/09/23 13:02:12] [error] binlogstreamer.go:77 close sync with err: Sync was closed
2020-09-23 13:02:12 INFO Closed streamer connection. err=<nil>
[2020/09/23 13:02:12] [info] binlogsyncer.go:179 syncer is closed
2020-09-23 13:02:12 INFO Dropping table `table`.`_users_ghc`
2020-09-23 13:02:12 ERROR Error 1146: Table 'table._users_ghc' doesn't exist
2020-09-23 13:02:12 INFO Table dropped
2020-09-23 13:02:12 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2020-09-23 13:02:12 INFO -- drop table `table`.`_users_del`
2020-09-23 13:02:12 INFO Done migrating `table`.`users`
2020-09-23 13:02:12 INFO Removing socket file: /tmp/gh-ost.table.users.sock
2020-09-23 13:02:12 ERROR Error 1146: Table 'table._users_ghc' doesn't exist
2020-09-23 13:02:12 INFO Tearing down inspector
2020-09-23 13:02:12 INFO Tearing down applier
2020-09-23 13:02:12 INFO Tearing down streamer
2020-09-23 13:02:12 INFO Tearing down throttler
# Done

after it finished, I went to tab with secondary mysql console, did a show tables, and saw that _gho and _ghc files were there. As gh-ost finished and exited successfully I thought that these were left here accidentally so I dropped _ghc table (about 30 seconds after gh-ost finished). On the primary mysql node I only have _users_del table, none of the _ghc or _gho tables. After that slave replication broke with:

2020-09-23T13:03:05.781954Z 1878531 [ERROR] Slave SQL for channel '': Worker 4 failed executing transaction '6e82c8e7-e7ac-11ea-975d-0ec8ae9dfdc3:5723330704' at master log mysql-bin.003650, end_log_pos 527415518; Error executing row event: 'Table 'table._users_ghc' doesn't exist', Error_code: 1146

In docs, I couldn't really find any details about the cleanup phase. I found an issue where they were asking about how to detect if _ghc table is dead: https://github.com/github/gh-ost/issues/99 I was interested in what happens during teardown with the temp tables, and how. I found this function those the dropping of the table for ghc and others https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L245 On another alter that was successful see the ERROR for missing table also:

[2020/09/23 12:23:24] [info] binlogsyncer.go:164 syncer is closing...
2020-09-23 12:23:24 INFO Closed streamer connection. err=<nil>
2020-09-23 12:23:24 INFO Dropping table `table`.`_user_settings_ghc`
[2020/09/23 12:23:24] [error] binlogsyncer.go:631 connection was bad
[2020/09/23 12:23:24] [error] binlogstreamer.go:77 close sync with err: Sync was closed
[2020/09/23 12:23:24] [info] binlogsyncer.go:179 syncer is closed
2020-09-23 12:23:25 INFO Table dropped
2020-09-23 12:23:25 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2020-09-23 12:23:25 INFO -- drop table `table`.`_user_settings_del`
2020-09-23 12:23:25 INFO Done migrating `table`.`user_settings`
2020-09-23 12:23:25 INFO Removing socket file: /tmp/gh-ost.table.user_settings.sock
2020-09-23 12:23:25 INFO Tearing down inspector
2020-09-23 12:23:25 ERROR Error 1146: Table 'table._user_settings_ghc' doesn't exist
2020-09-23 12:23:25 INFO Tearing down applier
2020-09-23 12:23:25 ERROR sql: database is closed
2020-09-23 12:23:25 INFO Tearing down streamer
2020-09-23 12:23:25 INFO Tearing down throttler
# Done

In this case, teardown was completed and when checking show tables only _del table was present.

For the changelog writer function: https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L279-L286 I see that it inserts into table but on possible duplication updates it. I did one test to see if by a healthy replication I delete from test slave a row and then run a query like:

insert into data (id,hint,value) values (1234, "heartbeat", "123456") on duplicate key update last_update=NOW(),value=VALUES(value);

I wanted to test if it would update a row on master and create a row on the slave, but I found out that it fails replication with:

Last_Error: Could not execute Update_rows event on table test.data; Can't find record in 'data', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000007, end_log_pos 2357

So after the drop a recreation wouldn't help. Investigating deeper there is one solution/workaround I found which would solve the issue. If I manually inserted a row with the same id and started replication it updates the row and continues replication successfully.

So in gh-ost case: https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L272-L277 There are 3 possible "hints" or id's, and if we inserted 3 rows and started replication, then slave would have applied updates, and as final dropped the table. Is this correct assumption?

Investigating changelog events further what I found was there are 3 states that the table handles: "state" I only found it here: https://github.com/github/gh-ost/blob/5e953b7e3eb13716e84a9c4017cc8add0648b8d0/go/logic/migrator.go#L544 It is used during the cutover: https://github.com/github/gh-ost/blob/5e953b7e3eb13716e84a9c4017cc8add0648b8d0/go/logic/migrator.go#L587 So after the switch, it should not have any updates. "heartbeat" Since this is async: https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L312 Which runs in cycles, and I suspect an update from this cycle was still syncing and caused the issue. "throttle" Comes from the throttle checker: https://github.com/github/gh-ost/blob/c07d08f8b58e170da7031624c1a8ec93e705d1c0/go/logic/throttler.go#L435 After cutover, I don't really believe we would get throttled event since the cleanup has happened.

Are my findings of _ghc correct? Next time I know to wait longer fo slave to sync up and empty its events, just wanted to understand problem and possible fix for future.

shlomi-noach commented 3 years ago

In your logs:

2020-09-23 13:02:12 ERROR Error 1146: Table 'table._users_ghc' doesn't exist

This shouldn't happen; I'm not sure how this came to be. Does the problem reproduce?

after it finished, I went to tab with secondary mysql console, did a show tables, and saw that _gho and _ghc files were there. As gh-ost finished and exited successfully I thought that these were left here accidentally so I dropped _ghc table (about 30 seconds after gh-ost finished). On the primary mysql node I only have _users_del table, none of the _ghc or _gho tables. After that slave replication broke with...

At least that part makes some sense: you dropped a table on a replica, and later on replication broke because the primary (master) did have that table, and when you finally dropped the table on the primary, the statement could not replicate because table was not found on the replica. Now, I suspect there may have been a significant replication lag.

I'm not sure I understand your conclusions around deleting rows from _ghc tables. You're deleting rows on the replica, then inserting them again to force replication to recover... Rows on _ghc tables should have nothing to do with the original problem, which is that _users_ghc table could not be found.

Could you please paste your original CLI ocmmand (gh-ost -alter ...) and the full log, which shows which hosts gh-ost identifies as replica & master?