Open evanelias opened 8 years ago
@evanelias thank you so much for offering to work on this! I'm happy to hand this over to you or collaborate with you as you see fit.
As for the options, it must be option (b). Reason is that --test-on-replica
's premise is "test in production without affecting production". Hence, I wouldn't want anything done on the master; it must be kept untouched.
Purging binary logs on the replica is fine, and if we want to be strict about it, we may choose to ask the user for some --approve-deleting-binary-logs-on-a-replica-test-with-gtid
-ish flag.
Cool, that makes sense re: going with option (b) since it only impacts the replica. I'll probably have some questions (especially around testing) when I start to dig in, hopefully sometime in November. Looking forward to contributing!
@shlomi-noach @evanelias
I'm pretty close to a PR for this. But this GTID surgery has me wondering. If we purge the GTIDs on the replica, it still won't be promotable right? If anything tries to replicate from it, they'll fail almost immediately with something along the lines of:
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
No?
If that's true, I'm wondering if instead we skip the bin log for all of the DML statements, but leave the changelog table and any changelog state changes. That would cut down the vast majority of bin log writing for scenarios that are disk space-sensitive.
But then we could also write the empty/dummy transactions to a new sql file (that a CLI switch specifies). That way we don't apply any surgery, or make changes to master. But they can apply the SQL file manually to any other server to make the replica promotable/replicatable again.
tl;dr I think we may still have problems with GTID surgery, but not I'm not 100%. What if we cut out most binlog writing and write out a new SQL file that the admin can use to fix things?
I believe the affected replica should still be promotable in typical situations. AFAIK you'd only get error 1236 if you're trying to repoint a severely lagging replica -- the error comes up if you've purged "binary logs containing GTIDs that the slave requires". Any replica that is relatively caught-up won't require anything from those purged binlogs. There's only a problem if there are transactions from the original master, that have made it to the gh-ost replica prior to the GTID surgery, but still haven't made it to any other replicas yet by the time of the promotion. (There's also a problem if there are other errant transactions that were previously made directly to the gh-ost replica... but errant transactions are a GTID promotion problem in general, so not specific to this.)
So realistically, that error should only happen if a promotion occurs very soon after gh-ost completes, and some replicas are lagging enough that they're still behind the point-in-time that the gh-ost run completed. That's definitely a valid concern (worth mentioning in docs at least), but IMO it should be rare enough that the surgery procedure is still preferable vs requiring manual execution of a sql file. That said, your sql file idea is definitely clever, maybe makes sense as an option/alternative? Another possibility would be for users to use another tool for inserting the empty transactions, e.g. Orchestrator.
I should also give the disclaimer that my deep GTID work was back in 2014-2016, and mostly with fb-mysql 5.6 and Percona Server 5.6. So it's definitely possible I'm forgetting something here, or may have inaccuracies vs upstream mysql, especially 5.7 or 8.0.
@evanelias You're right! I missed a key piece of your notes above:
Query @@global.gtid_executed and store in a variable, and then modify it so that the chunk corresponding to the replica's own server_uuid matches its prior state. i.e., figure out what its gtid_executed "should" be if the gh-ost transactions had not occurred.
The bolded part was quite important :)
One question on "PURGE BINARY LOGS TO ...". PURGE BINARY LOGS BEFORE NOW();
seems to work alright. But I'm curious if that's just a false positive with my test data here. Any advice there? It saves me from having to some querying of SHOW BINARY LOGS
.
There's also a problem if there are other errant transactions that were previously made directly to the gh-ost replica... but errant transactions are a GTID promotion problem in general, so not specific to this.
Yeah this is definitely something I've thought about. Because I've run into this scenario in production (a sensu script making sure a diagnostic table was available on all dbs periodically). One thought I've had is to define gtid_next
before any binlog statement we allow with a gh-ost generated UUID. That way we'll confidently trim gh-ost-only GTIDS. However, I'm a little iffy on that approach. I'm worried that I may run into some timing issues with the statement counter part of the GTID. But that's a bit of a stretch goal, and not the primary problem I'm looking to solve.
If there was a way to dynamically change the server_uuid portion of the automatic GTID generation for the session that would be aces. But I don't see any way to do that.
While I'm bugging you. I'm not seeing any documentation of GTID in MySQL 5.5. I can easily panic early if we notice they're wanting this surgery on v5.5 before gh-ost does anything right?
PURGE BINARY LOGS BEFORE NOW()
That's probably fine, I'd just be a little paranoid about potential edge cases. If the last few steps all occur in the same clock second, does the first new binlog still get purged? Are there any risks related to system timezone differences? etc...
I'm not seeing any documentation of GTID in MySQL 5.5.
Correct, GTID was added in 5.6.
I can easily panic early if we notice they're wanting this surgery on v5.5 before gh-ost does anything right?
I'd assume so. You'll get an immediate error in the pre-migration step on 5.5 anyway, since gtid_executed
won't exist on 5.5.
Side note: when obtaining gtid_executed, it's best to use a SELECT
, e.g. SELECT @@global.gtid_executed
, rather than a SHOW
command. The value can sometimes get rather long (after many promotions have occurred), and SHOW
truncates long strings.
and SHOW truncates long strings.
😮 😮 😮 TIL + Oh dear + need to check a few scripts now!!!
On closer inspection, that long-string-truncation problem might only apply to SHOW VARIABLES
, but perhaps not SHOW SLAVE STATUS
. Apologies if that caused confusion.
The manual mentions it for SHOW VARIABLES
but doesn't say what the limit is:
SHOW VARIABLES is subject to a version-dependent display-width limit. For variables with very long values that are not completely displayed, use SELECT as a workaround
I vaguely remember the limit being 1024 chars, which seems to line up with #define SHOW_VAR_FUNC_BUFF_SIZE
in mysql-server's include/mysql/status_var.h. It takes a lot of promotions for gtid_executed to hit that limit, but it is certainly possible.
More questions...
Do you all think this GTID surgery is safe for --allow-on-master
? (per @shlomi-noach's description on #149).
--sql-log-bin would default true. When false, writes to ghost table will not be logged to binlogs. This may only work on...
--allow-on-master
And something that has been on my mind:
There's also a problem if there are other errant transactions that were previously made directly to the gh-ost replica... but errant transactions are a GTID promotion problem in general, so not specific to this.
There's already a little nervousness on doing this GTID surgery on a replica that may take writes directly during the migration. But I'm actually more worried about allowing this workflow on a master/writer that is being mutated with operational data.
We could still reduce the binary logging dramatically. But that feels like it leads to some weird situations where we swap to an otherwise empty table.
I'm wondering about only allowing this no-sql-log-bin behavior on only replicas (via --test-on-replica
and --migrate-on-replica
).
Note: I am intentionally combining the behavior of --sql-log-bin=false
and this GTID cleanup. I haven't really thought through if they should be separate concepts. Mainly because we can't go 100% --sql-log-bin=false
without the GTID cleanup.
Hmm, when would the surgery be necessary on a master? If I understand gh-ost's behavior correctly, the surgery is only ever needed for --test-on-replica
or --migrate-on-replica
. (There's no notion of errant transactions on a master.)
Not sure I follow #149 mentioning combining --allow-on-master with --sql-log-bin=false either though, so I may just be missing something :) Is that combination intended just to speed up cases where there are no replicas?
Chiming in. The RESET MASTER
operation is well understood and doable. You can see it in action in orchestrator
.
Basically, if the migrated replica is not itself an intermediate to other replicas, it is possible, once migration is complete, to erase all errant GTIDs from that server. The server will be perfectly promotable.
PURGE BINARY LOGS BEFORE NOW()
is not part of that flow. RESET MASTER
will auto-purge all the binary logs and start a new one afresh.
There's a little bit of magic that orchestrator
runs to identify the UUIDs of all ancestors, but that is not strictly needed with gh-ost
. orchestrator
wishes to avoid a specific ordering for probing the replication chain, but for gh-ost
it makes perfect sense to run a sequential probe into the replication chain (actually we only need to probe the immediate master of the migrated replica) in order to be able to identify the required value for gtid_purged
.
Hmm that was a long paragraph and perhaps not well written. It's really a blog post scale.
I only got to get a good grasp of all the logic involved last year, when I worked on automating said logic via orchestrator
.
Thanks @shlomi-noach just trying to make sure I understand.
Basically, if the migrated replica is not itself an intermediate to other replicas, it is possible, once migration is complete, to erase all errant GTIDs from that server. The server will be perfectly promotable.
Ok so that sounds like as long I check for the existence any replicas and bail out early (a todo in my PR), this GTID process should be okay on a master server as well.
PURGE BINARY LOGS BEFORE NOW() is not part of that flow. RESET MASTER will auto-purge all the binary logs and start a new one afresh.
Could you expand on this? Are you saying I should favor PURGE BINARY LOGS TO ...
or that I don't even need to run any PURGE BINARY ...
commands?
There's a little bit of magic that orchestrator runs to identify the UUIDs of all ancestors, but that is not strictly needed with gh-ost. orchestrator wishes to avoid a specific ordering for probing the replication chain, but for gh-ost it makes perfect sense to run a sequential probe into the replication chain (actually we only need to probe the immediate master of the migrated replica) in order to be able to identify the required value for gtid_purged. Hmm that was a long paragraph and perhaps not well written. It's really a blog post scale.
Sorry, I'm trying to follow here. Are you suggesting that we could rely on gtid_next
for gh-ost when you mention the "sequential probe into the replication chain"?
Apologies for the confusion. Just trying to make sure I understand.
this GTID process should be okay on a master server as well.
Define "this GTID process"? You most definitely do not want to RESET MASTER
on a master, since that will fail all replicas.
And, there's no problem with migrating on master
, since that generates GTIDs by the master
, which is trivially a normal thing.
So, migrating on master -- there's no problem to fix, so let's not fix it.
Could you expand on this? Are you saying I should favor PURGE BINARY LOGS TO ... or that I don't even need to run any PURGE BINARY ... commands?
You should not use PURGE BINARY LOGS
. I'm not sure where that statement came from; it is not needed, as it is superfluous. RESET MASTER
purges all of the binary logs. The correct flow uses a RESET MASTER
.
Sorry, I'm trying to follow here. Are you suggesting that we could rely on gtid_next for gh-ost when you mention the "sequential probe into the replication chain"?
Yeah, that paragraph of mine was terribly delivered. Basically what I mean is, that when testing-on-replica or running-on-replica gh-ost
is easily able to determine what errant transactions were generated by the mgiration on that replica, and it's very easy to compute the set of statements that "forget" those errant transactions.
gtid_next
has nothing to do with this.
The flow basically is to RESET MASTER
, SET GLOBAL gtid_purged=<the-correct-value>
and change master to master_auto_position=1
.
So, migrating on master -- there's no problem to fix, so let's not fix it.
Ah I was tying this together with #149 which mentioned supporting --allow-on-master
. To be clear, this process we're talking about here should only be allowed for --test-on-replica
and --migrate-on-replica
correct?
You should not use PURGE BINARY LOGS. I'm not sure where that statement came from; it is not needed, as it is superfluous.
👍
... when testing-on-replica or running-on-replica gh-ost is easily able to determine what errant transactions were generated by the mgiration on that replica, and it's very easy to compute the set of statements that "forget" those errant transactions.
Well, most of the time. If the company has a process that writes to the replicas directly then we'll end up cutting that out because it's hard to determine the GTIDs from gh-ost and the GTIDs from those statements since they both use the same server_uuid. For example, my previous team had a process running CREATE TABLE IF NOT EXISTS ...
that ran in a cron across all servers.
Above I had suggested we could attempt to rely on gtid_next
before each statement and supply our own server_uuid
portion, that way we isolate the gh-ost statements.
The flow basically is to RESET MASTER, SET GLOBAL gtid_purged=
and change master to master_auto_position=1.
change master to master_auto_position=1
I trust you, just trying to grok this flow. What would happen if we didn't run this last statement? I'm not doing that right now and everything seemed to be okay, but I entirely recognize I may be missing an important scenario.
You should not use PURGE BINARY LOGS. I'm not sure where that statement came from; it is not needed, as it is superfluous
@shlomi-noach it came from my original proposed process here at the top of #254. PURGE BINARY LOGS
is used in two places, for two different reasons, although both relate to edge cases I have actually encountered:
The first purge is to speed up the subsequent RESET MASTER
. My thinking is:
RESET MASTER
.RESET MASTER
can be very slow on filesystems with slow deletion of large files (e.g. ext3)RESET MASTER
is guaranteed to be fast, and we don't need to keep replication stopped for very long.The second purge relates to issues with binlog_gtid_simple_recovery being enabled. On that option's MySQL 5.6 manual page, see the 2nd bullet of the blue Note:
If this option is enabled, gtid_executed and gtid_purged may be initialized incorrectly in
the following situations:
...
* A SET GTID_PURGED statement was issued after the oldest existing binary log was
generated.
If an incorrect GTID set is computed in either situation, it will remain incorrect even if
the server is later restarted, regardless of the value of this option.
To handle this situation, it is necessary to ensure the oldest binlog has the correct previous-gtid-executed header. This can be accomplished by flushing and then purging the logs, after we've already set gtid_purged
. Basically we just want to eliminate the first new binlog, which has an empty previous-gtid-executed header.
This doesn't appear to be an issue in mysql 5.7.8+; more on that below.
Two other related things to mention:
I'm hesitant about using the BEFORE NOW()
variant on the 2nd purge; see https://github.com/github/gh-ost/issues/254#issuecomment-502524514. But I haven't tested. When I've implemented this surgery in other codebases, there was already a helper method to get the list of binlogs, so it was easy to use PURGE BINARY LOGS TO '<current binlog file>'
which feels more precise.
MySQL 5.7 introduced a mysql.gtid_executed
system table, which is an extra point of persistence, described in detail in the manual. Its existence may mean there are extra or different steps involved in 5.7 and 8.0; I haven't worked through this, but would encourage additional testing and experimentation. Also note there are some terrible known shortcomings with that system table. fb-mysql used a different implementation to avoid this (basically exactly what JFG later proposed), but I don't know of any upstream equivalent.
actually we only need to probe the immediate master of the migrated replica) in order to be able to identify the required value for gtid_purged
Why not just use the value of gtid_executed
on the migrated replica prior to the gh-ost operation? (rather than having to probe the immediate master)
Ok so that sounds like as long I check for the existence any replicas and bail out early (a todo in my PR), this GTID process should be okay on a master server as well.
@zmoazeni The process should not be run on any instance with replicas, and should not be run on any instance taking direct writes. I would recommend having safety checks for both, as otherwise the GTID surgery is catastrophically bad. For the latter check, perhaps confirm the instance is read_only
and/or that the portion of gtid_executed
corresponding to the server's own uuid does not grow after a few seconds (just as a sanity-check against superuser writes).
If the company has a process that writes to the replicas directly then we'll end up cutting that out because it's hard to determine the GTIDs from gh-ost and the GTIDs from those statements since they both use the same server_uuid. For example, my previous team had a process running CREATE TABLE IF NOT EXISTS ... that ran in a cron across all servers
Direct writes to replicas should almost certainly be run with SQL_LOG_BIN=0
... otherwise they are introducing errant transactions, which I'd view as user error / buggy automation / not something that gh-ost is responsible for handling.
Side note, I'll give the PR a quick scan momentarily. Just a quick warning that my comments on this issue and PR may be a bit brief or delayed this week, as I'm in the middle of launching a beta of a new product as we speak :)
Well, most of the time. If the company has a process that writes to the replicas directly then we'll end up cutting that out because it's hard to determine the GTIDs from gh-ost and the GTIDs from those statements since they both use the same server_uuid. For example, my previous team had a process running CREATE TABLE IF NOT EXISTS ... that ran in a cron across all servers.
@zmoazeni in that case, the replica is already "dirty" with errant GTID, as part of your norm. If you wanted to promote it as master you'd have a GTID problem anyway. How does gh-ost
make that any better or worse?
change master to master_auto_position=1 I trust you, just trying to grok this flow. What would happen if we didn't run this last statement? I'm not doing that right now and everything seemed to be okay, but I entirely recognize I may be missing an important scenario.
I'm wondering what show slave status\G
output you have. What do you have for Auto_position
?
@evanelias OMG https://www.skeema.io/ci/ !!!
The first purge is to speed up the subsequent RESET MASTER.
@evanelias ahhhh... that makes perfect sense! Thank you for going into detail!
I confess I don't understand the 5.6
issue:
A SET GTID_PURGED statement was issued after the oldest existing binary log was generated.
If I run a RESET MASTER
, then what does "oldest existing binary log" even mean? Immediately at RESET MASTER
the binary logs are removed, and a mysql-bin.000001
is created. By definition the set global gtid_purged
is then issued after mysql-bin.000001
(now the oldest binary log file) is created.
Why not just use the value of gtid_executed on the migrated replica prior to the GTID surgery? (rather than having to probe the immediate master
You are right. I was still thinking on orchestrator-related scenarios. That's true. Knowing the UUID of the replica is enough, and a gtid_executed - replica's-UUID
is enough.
If I run a RESET MASTER, then what does "oldest existing binary log" even mean? Immediately at RESET MASTER the binary logs are removed, and a mysql-bin.000001 is created. By definition the set global gtid_purged is then issued after mysql-bin.000001 (now the oldest binary log file) is created.
@shlomi-noach that's correct; but the issue is that mysql-bin.000001 has a blank previous-gtid-executed header, which confuses binlog_gtid_simple_recovery in 5.6:
gtid_executed
and gtid_purged
.gtid_purged
in 5.6:
gtid_purged
. This is safe in all situations, but is potentially very slow.gtid_purged
. Ordinarily this makes sense: The list of purged GTIDs is equal to the list of GTIDs that occurred prior to the oldest remaining binlog file. But this optimization does the wrong thing in one specific situation: if a SET gtid_purged = ...
was used at some point after the creation of that oldest binlog file. In that case, upon restarting mysqld, gtid_purged
will be empty when it should not be.So to fix this in gh-ost, after running SET gtid_purged = ...
we want to end up in a situation where the oldest binlog has a non-empty previous-gtid header. In other words, we need to persist the gtid_purged value that we just set. We can do this by rolling to a new binlog (FLUSH LOCAL BINARY LOGS
) -- so it creates mysql-bin.000002 which will have a correct previous-gtid header. And then we delete mysql-bin.000001 (PURGE BINARY LOGS ...
) so that mysql-bin.000002 is now the oldest existing binary log, which will be consulted if a restart occurs soon after.
I believe this is only necessary in 5.6, and only if binlog_gtid_simple_recovery=1 (which isn't the default in 5.6). That situation was much more common when I created this issue in 2016 :) Nowadays it might be rare enough to just call out in the docs, instead of automating around it. Then again, the impact is pretty bad if this situation is hit and upon restart gtid_purged
is miscomputed. I forget whether replication breaks or whether the replica starts re-executing old transactions, but either way it is a bad time and the replica is trashed 😱
we need to persist the gtid_purged value that we just set. We can do this by rolling to a new binlog (FLUSH LOCAL BINARY LOGS) -- so it creates mysql-bin.000002 which will have a correct previous-gtid header. And then we delete mysql-bin.000001 (PURGE BINARY LOGS ...) so that mysql-bin.000002 is now the oldest existing binary log, which will be consulted if a restart occurs soon after.
Ha, OK, again thanks for explaning -- OK cool, I think that's a relatively simple flow to automate. We can choose to do that with regard or regardless of the version and of the binlog_gtid_simple_recovery
value.
As Shlomi mentioned in his recent blog post, GTID is problematic with
--test-on-replica
or--migrate-on-replica
because the replica will no longer be promotable due to having extra gtid_executed transactions. As alluded to in the post, there are workarounds for this. On the GTID side of things, I can think of a couple different ways:(a) Insert empty transactions with correct gtid_next on the master (and replicating down to all replicas), so that gtid_executed of all nodes now effectively matches the modified replica's or (b) Adjust the modified replica's gtid_executed via convoluted RESET MASTER song-and-dance
Both solutions require SUPER. Hopefully there are RDS equivalents but this needs more research.
If there's interest, I might be able to spend some time on a PR in a few months (just not in Oct, hoping to release another project before the Percona Live submission deadline :) Any guidance on which solution is more palatable?
Option (a) is a bit cumbersome in the
--migrate-on-replica
case, since it may require a very large number of empty transactions to be inserted, and it involves interacting with the master. OTOH, this could be implemented as a general-purpose gtid_executed-syncing external tool which gh-ost just shells out to at the appropriate time.Option (b) limits its impact to just the modified replica. The drawbacks are that all binlogs on the replica get deleted (probably ok since they're polluted with all those extra writes anyway); and the modified replica must not have replicas of its own. The automated procedure is something like this:
BEFORE MIGRATION:
AFTER MIGRATION:
PURGE BINARY LOGS TO '<current log file>'
to delete all old binlogs. This is just done here first so that the laterRESET MASTER
step is fast, in cases where there are lots of large binlogs and filesystems where deletion is slowSTOP SLAVE
-- so that we can get a consistent position. The steps after are fast so replication won't need to be stopped for more than a split second.RESET MASTER
-- this will nuke gtid_executed and gtid_purged, along with the one remaining current binlogSET GLOBAL GTID_PURGED = '<gtid_executed value calculated from prior step>'
-- this also inherently sets gtid_executedFLUSH LOCAL BINARY LOGS
-- roll to a new binlog immediately, so that the gtid_executed value gets persisted in 5.6, which weirdly uses the binlogs as only persisted source of truthPURGE BINARY LOGS TO '<current binlog file>'
-- erase the first binlog file in case binlog_gtid_simple_recovery is in use, since it would cause issues upon mysqld restartSTART SLAVE
(unless leaving this as the user's responsibility, I think that's already the assumption in--test-on-replica
)