Open s4mur4i opened 4 years ago
In your logs:
2020-09-23 13:02:12 ERROR Error 1146: Table 'table._users_ghc' doesn't exist
This shouldn't happen; I'm not sure how this came to be. Does the problem reproduce?
after it finished, I went to tab with secondary mysql console, did a show tables, and saw that _gho and _ghc files were there. As gh-ost finished and exited successfully I thought that these were left here accidentally so I dropped _ghc table (about 30 seconds after gh-ost finished). On the primary mysql node I only have _users_del table, none of the _ghc or _gho tables. After that slave replication broke with...
At least that part makes some sense: you dropped a table on a replica, and later on replication broke because the primary
(master
) did have that table, and when you finally dropped the table on the primary
, the statement could not replicate because table was not found on the replica.
Now, I suspect there may have been a significant replication lag.
I'm not sure I understand your conclusions around deleting rows from _ghc
tables. You're deleting rows on the replica, then inserting them again to force replication to recover... Rows on _ghc
tables should have nothing to do with the original problem, which is that _users_ghc
table could not be found.
Could you please paste your original CLI ocmmand (gh-ost -alter ...
) and the full log, which shows which hosts gh-ost
identifies as replica & master?
Hello,
I would have a question regarding the cleanup phase of gh-ost. After running an alter, gh-ost exited successfully. Here are the logs from that run:
after it finished, I went to tab with secondary mysql console, did a
show tables
, and saw that _gho and _ghc files were there. As gh-ost finished and exited successfully I thought that these were left here accidentally so I dropped _ghc table (about 30 seconds after gh-ost finished). On the primary mysql node I only have _users_del table, none of the _ghc or _gho tables. After that slave replication broke with:In docs, I couldn't really find any details about the cleanup phase. I found an issue where they were asking about how to detect if _ghc table is dead: https://github.com/github/gh-ost/issues/99 I was interested in what happens during teardown with the temp tables, and how. I found this function those the dropping of the table for ghc and others https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L245 On another alter that was successful see the ERROR for missing table also:
In this case, teardown was completed and when checking
show tables
only _del table was present.For the changelog writer function: https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L279-L286 I see that it inserts into table but on possible duplication updates it. I did one test to see if by a healthy replication I delete from test slave a row and then run a query like:
I wanted to test if it would update a row on master and create a row on the slave, but I found out that it fails replication with:
So after the drop a recreation wouldn't help. Investigating deeper there is one solution/workaround I found which would solve the issue. If I manually inserted a row with the same id and started replication it updates the row and continues replication successfully.
So in gh-ost case: https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L272-L277 There are 3 possible "hints" or id's, and if we inserted 3 rows and started replication, then slave would have applied updates, and as final dropped the table. Is this correct assumption?
Investigating changelog events further what I found was there are 3 states that the table handles: "state" I only found it here: https://github.com/github/gh-ost/blob/5e953b7e3eb13716e84a9c4017cc8add0648b8d0/go/logic/migrator.go#L544 It is used during the cutover: https://github.com/github/gh-ost/blob/5e953b7e3eb13716e84a9c4017cc8add0648b8d0/go/logic/migrator.go#L587 So after the switch, it should not have any updates. "heartbeat" Since this is async: https://github.com/github/gh-ost/blob/c940a85a28bad68878c5d1622aa7c4e595b35b38/go/logic/applier.go#L312 Which runs in cycles, and I suspect an update from this cycle was still syncing and caused the issue. "throttle" Comes from the throttle checker: https://github.com/github/gh-ost/blob/c07d08f8b58e170da7031624c1a8ec93e705d1c0/go/logic/throttler.go#L435 After cutover, I don't really believe we would get throttled event since the cleanup has happened.
Are my findings of _ghc correct? Next time I know to wait longer fo slave to sync up and empty its events, just wanted to understand problem and possible fix for future.