[Question] Drop of _ghc table after succesful completion

Hello,

I would have a question regarding the cleanup phase of gh-ost. After running an alter, gh-ost exited successfully. Here are the logs from that run:

Copy: 66215590/66215590 100.0%; Applied: 958916; Backlog: 408/1000; Time: 35m30s(total), 33m55s(copy); streamer: mysql-bin.003556:178345318; Lag: 0.04s, State: postponing cut-over; ETA: due
2020-09-23 13:02:11 INFO Grabbing voluntary lock: gh-ost.3470966548.lock
2020-09-23 13:02:11 INFO Setting LOCK timeout as 6 seconds
2020-09-23 13:02:11 INFO Looking for magic cut-over table
2020-09-23 13:02:11 INFO Creating magic cut-over table `table`.`_users_del`
2020-09-23 13:02:11 INFO Magic cut-over table created
2020-09-23 13:02:11 INFO Locking `table`.`users`, `table`.`_users_del`
2020-09-23 13:02:11 INFO Tables locked
2020-09-23 13:02:11 INFO Session locking original & magic tables is 3470966548
2020-09-23 13:02:11 INFO Writing changelog state: AllEventsUpToLockProcessed:1600866131471181760
2020-09-23 13:02:11 INFO Waiting for events up to lock
2020-09-23 13:02:11 INFO Intercepted changelog state AllEventsUpToLockProcessed
2020-09-23 13:02:11 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 66215590/66215590 100.0%; Applied: 965676; Backlog: 146/1000; Time: 35m45s(total), 33m55s(copy); streamer: mysql-bin.003556:212318228; Lag: 0.04s, State: migrating; ETA: due
2020-09-23 13:02:12 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1600866131471181760
2020-09-23 13:02:12 INFO Done waiting for events up to lock; duration=753.747866ms
# Migrating `table`.`users`; Ghost table is `table`.`_users_gho`
# Migrating primary.address:3306; inspecting secondary.address:3306; executing on secondary.address
# Migration started at Wed Sep 23 12:26:26 +0000 2020
# chunk-size: 10000; max-lag-millis: 1000ms; dml-batch-size: 10; max-load: Threads_running=100; critical-load: Threads_running=1000; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag
# panic-flag-file: /tmp/ghost.panic.flag
# Serving on unix socket: /tmp/gh-ost.table.users.sock
Copy: 66215590/66215590 100.0%; Applied: 965821; Backlog: 0/1000; Time: 35m46s(total), 33m55s(copy); streamer: mysql-bin.003556:212618836; Lag: 0.04s, State: migrating; ETA: due
2020-09-23 13:02:12 INFO Setting RENAME timeout as 3 seconds
2020-09-23 13:02:12 INFO Session renaming tables is 3471207277
2020-09-23 13:02:12 INFO Issuing and expecting this to block: rename /* gh-ost */ table `table`.`users` to `table`.`_users_del`, `table`.`_users_gho` to `table`.`users`
2020-09-23 13:02:12 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
2020-09-23 13:02:12 INFO Checking session lock: gh-ost.3470966548.lock
2020-09-23 13:02:12 INFO Connection holding lock on original table still exists
2020-09-23 13:02:12 INFO Will now proceed to drop magic table and unlock tables
2020-09-23 13:02:12 INFO Dropping magic cut-over table
2020-09-23 13:02:12 INFO Releasing lock from `table`.`users`, `table`.`_users_del`
2020-09-23 13:02:12 INFO Tables unlocked
2020-09-23 13:02:12 INFO Tables renamed
2020-09-23 13:02:12 INFO Lock & rename duration: 1.083016852s. During this time, queries on `users` were blocked
2020-09-23 13:02:12 INFO Looking for magic cut-over table
[2020/09/23 13:02:12] [info] binlogsyncer.go:164 syncer is closing...
[2020/09/23 13:02:12] [error] binlogsyncer.go:631 connection was bad
[2020/09/23 13:02:12] [error] binlogstreamer.go:77 close sync with err: Sync was closed
2020-09-23 13:02:12 INFO Closed streamer connection. err=<nil>
[2020/09/23 13:02:12] [info] binlogsyncer.go:179 syncer is closed
2020-09-23 13:02:12 INFO Dropping table `table`.`_users_ghc`
2020-09-23 13:02:12 ERROR Error 1146: Table 'table._users_ghc' doesn't exist
2020-09-23 13:02:12 INFO Table dropped
2020-09-23 13:02:12 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2020-09-23 13:02:12 INFO -- drop table `table`.`_users_del`
2020-09-23 13:02:12 INFO Done migrating `table`.`users`
2020-09-23 13:02:12 INFO Removing socket file: /tmp/gh-ost.table.users.sock
2020-09-23 13:02:12 ERROR Error 1146: Table 'table._users_ghc' doesn't exist
2020-09-23 13:02:12 INFO Tearing down inspector
2020-09-23 13:02:12 INFO Tearing down applier
2020-09-23 13:02:12 INFO Tearing down streamer
2020-09-23 13:02:12 INFO Tearing down throttler
# Done

after it finished, I went to tab with secondary mysql console, did a show tables, and saw that _gho and _ghc files were there. As gh-ost finished and exited successfully I thought that these were left here accidentally so I dropped _ghc table (about 30 seconds after gh-ost finished). On the primary mysql node I only have _users_del table, none of the _ghc or _gho tables. After that slave replication broke with:

2020-09-23T13:03:05.781954Z 1878531 [ERROR] Slave SQL for channel '': Worker 4 failed executing transaction '6e82c8e7-e7ac-11ea-975d-0ec8ae9dfdc3:5723330704' at master log mysql-bin.003650, end_log_pos 527415518; Error executing row event: 'Table 'table._users_ghc' doesn't exist', Error_code: 1146

In docs, I couldn't really find any details about the cleanup phase. I found an issue where they were asking about how to detect if _ghc table is dead: https://github.com/github/gh-ost/issues/99 I was interested in what happens during teardown with the temp tables, and how. I found this function those the dropping of the table for ghc and others https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L245 On another alter that was successful see the ERROR for missing table also:

[2020/09/23 12:23:24] [info] binlogsyncer.go:164 syncer is closing...
2020-09-23 12:23:24 INFO Closed streamer connection. err=<nil>
2020-09-23 12:23:24 INFO Dropping table `table`.`_user_settings_ghc`
[2020/09/23 12:23:24] [error] binlogsyncer.go:631 connection was bad
[2020/09/23 12:23:24] [error] binlogstreamer.go:77 close sync with err: Sync was closed
[2020/09/23 12:23:24] [info] binlogsyncer.go:179 syncer is closed
2020-09-23 12:23:25 INFO Table dropped
2020-09-23 12:23:25 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2020-09-23 12:23:25 INFO -- drop table `table`.`_user_settings_del`
2020-09-23 12:23:25 INFO Done migrating `table`.`user_settings`
2020-09-23 12:23:25 INFO Removing socket file: /tmp/gh-ost.table.user_settings.sock
2020-09-23 12:23:25 INFO Tearing down inspector
2020-09-23 12:23:25 ERROR Error 1146: Table 'table._user_settings_ghc' doesn't exist
2020-09-23 12:23:25 INFO Tearing down applier
2020-09-23 12:23:25 ERROR sql: database is closed
2020-09-23 12:23:25 INFO Tearing down streamer
2020-09-23 12:23:25 INFO Tearing down throttler
# Done

In this case, teardown was completed and when checking show tables only _del table was present.

For the changelog writer function: https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L279-L286 I see that it inserts into table but on possible duplication updates it. I did one test to see if by a healthy replication I delete from test slave a row and then run a query like:

insert into data (id,hint,value) values (1234, "heartbeat", "123456") on duplicate key update last_update=NOW(),value=VALUES(value);

I wanted to test if it would update a row on master and create a row on the slave, but I found out that it fails replication with:

Last_Error: Could not execute Update_rows event on table test.data; Can't find record in 'data', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000007, end_log_pos 2357

So after the drop a recreation wouldn't help. Investigating deeper there is one solution/workaround I found which would solve the issue. If I manually inserted a row with the same id and started replication it updates the row and continues replication successfully.

So in gh-ost case: https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L272-L277 There are 3 possible "hints" or id's, and if we inserted 3 rows and started replication, then slave would have applied updates, and as final dropped the table. Is this correct assumption?

Investigating changelog events further what I found was there are 3 states that the table handles: "state" I only found it here: https://github.com/github/gh-ost/blob/5e953b7e3eb13716e84a9c4017cc8add0648b8d0/go/logic/migrator.go#L544 It is used during the cutover: https://github.com/github/gh-ost/blob/5e953b7e3eb13716e84a9c4017cc8add0648b8d0/go/logic/migrator.go#L587 So after the switch, it should not have any updates. "heartbeat" Since this is async: https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L312 Which runs in cycles, and I suspect an update from this cycle was still syncing and caused the issue. "throttle" Comes from the throttle checker: https://github.com/github/gh-ost/blob/c07d08f8b58e170da7031624c1a8ec93e705d1c0/go/logic/throttler.go#L435 After cutover, I don't really believe we would get throttled event since the cleanup has happened.

Are my findings of _ghc correct? Next time I know to wait longer fo slave to sync up and empty its events, just wanted to understand problem and possible fix for future.

github / gh-ost

[Question] Drop of _ghc table after succesful completion #886