Orchestrator Recovery Time (version 3.0.11)

rafael commented 6 years ago

Description

I discussed this long time ago with @shlomi-noach in Slack and I finally got the chance to start looking into this. I'm looking into better understand what are the expected times that orchestrator should take to recover a dead master.

In our setup, I'm seeing two patterns in how long it takes to recover a dead master and wanted to check if these are the expected times or if there are ways to improve this and shorten this time.

Pattern 1:

Time	Event
11:07:35	Master is Dead from App perspective
11:07:42	UnreachableMaster
11:07:46	DeadMaster
11:07:58	DeadMaster Recovered

Total Time: 23s

Pattern 2:

Time	Event
16:23:45	Master is Dead from App perspective
16:23:52	UnreachableMaster
16:23:55	DeadMaster
16:23:57	DeadMaster Recovered

Total Time: 12s

From my observations in these tests, the time that it takes orchestrator to declare a DeadMaster is pretty stable (around 10s).

The time that it takes to recover seems to vary and I would love to understand this a bit better. When I look in the audit log, when it takes longer, the time is being spent in these steps:

2018-10-05 11:07:46 | RecoverDeadMaster: regrouping replicas via GTID
-- | --
2018-10-05 11:07:58 | RecoverDeadMaster: 0 postponed functions

Other notes:

I tried what we discussed in Slack:

change master to master_heartbeat_period=5

But that didn't make a difference in the detection time.

shlomi-noach commented 6 years ago

@rafael both these times fall under our acceptable failover time expectation, but we got more on the 12sec side than on the 23sec side.

change master to master_heartbeat_period=5

A 10sec detection time is OK but can be made better; we see down to 5-7sec. You have updated heartbeat to 5 on all servers, correct? At chat, I suggested setting slave_net_timeout=5 which implicitly causes master_heartbeat_period to be half that value. That would shave off a couple more seconds. Did you also set MASTER_CONNECT_RETRY to 1? To elaborate, with both these settings at hand, a replica will detect master is gone by heartbeat interval, plus 1secit will take for it to attempt to reconnect to master, after which it will most surely claim "I'm broken".

As you work with GTID, and time is spent on regrouping replicas via GTID, I have some hints for you:

MySQL does not-so-good of a job in correlating the binary logs of a replica and a would-be master. At best case scenario it will scan the last binary log of the would-be master sequentially from top until it finds the position (GTID entry) to correlate the replica to. It will do so multiple times, once per replica.

(To contract, with Pseudo-GTID orchestrator records and caches heuristic "injection points" into the binary logs, and can jump directly into a heuristic last section of the log).

So: the time it takes to run regrouping replicas via GTID basically depends on the size of the binary logs.

Easy answer (which I have done before): standard binary log size is 1G. Reduce it to 100MB. You'll get 10x binary logs at smaller size. 100MB is small enough size for GTID scan to be faster.

Also, do you happen to have a delayed replica among those replicas which were regrouped? I incidentally optimized that in https://github.com/github/orchestrator/pull/641 (not yet merged).

rafael commented 6 years ago

@shlomi-noach - Correct, I updated all the servers. I will check MASTER_CONNECT_RETRY. Thank you for these pointers. I'll play with them and report back.

oh I was actually thinking the same about the GTID issue. Could we apply the same trick we did in Vitess in https://github.com/vitessio/vitess/issues/4161 to flush binary bin logs before we try to regroup GTIDs?

rafael commented 6 years ago

In the tests I did, the replicas were not delayed. But I think binlogs were big. I will try to reproduce this situation.

shlomi-noach commented 6 years ago

Could we apply the same trick we did in Vitess in vitessio/vitess#4161 to flush binary bin logs before we try to regroup GTIDs?

That won't help you at time of failover since there's no incoming writes to populate the newly rotated binary logs.

rafael commented 6 years ago

@shlomi-noach - Correct, I updated all the servers. I will check MASTER_CONNECT_RETRY. Thank you for these pointers. I'll play with them and report back.

oh I was actually thinking the same about the GTID issue. Could we apply the same trick we did in Vitess in vitessio/vitess#4161 to flush binary bin logs before we try to regroup GTIDs?

Ah yes. Forgot about that detail.

shlomi-noach commented 6 years ago

You can either configure the binary logs to be 100MB, or to routinely flush them via cron/daemon as soon as they extend 100MB (just make sure to SET SQL_LOG_BIN=0).

shlomi-noach commented 5 years ago

BTW you can ask orchestrator to flush logs on a server: orchestrator -c flush-binary-logs -i <instance>

rafael commented 5 years ago

Thanks @shlomi-noach. Yes! we've used that one in the past.

Our slave_net_timeout was too high and with a more careful setup, I was able to reproduce some slow detection times... similar to the ones we've seen in prod.

We are in the process of rolling out a change to set slave_net_timeout to 5s in prod.

Still thinking what to do about the binlogs issue.

I'll make sure to update here if we see improvements next time we get failure!

shlomi-noach commented 5 years ago

Looking forward for your update.

I got to thinking. Since the binlog size is a read-only variable in MySQL (need to restart server to apply it), it would be reasonable in my opinion to have orchestrator auto-flush binary logs that exceed a configured size. It's also easy to solve via cronjob, but orchestrator is already there, is already observing the binlog status, has the privileges to rotate logs...

openark / orchestrator

Orchestrator Recovery Time (version 3.0.11) #648

Description