openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.64k stars 933 forks source link

Some questions of failover. #419

Open Xinglao4 opened 6 years ago

Xinglao4 commented 6 years ago

Hi @shlomi-noach , When the master of MySQL is down, how can i know which slave would be the new master? Is it completely random? Firstly, can i prevent some slaves to be the new master? I have config the configuration file like this, but the node which named "mysql-sredb06.xh" still be promoted.

 "RecoveryIgnoreHostnameFilters": [
    "mysql-sredb05.xh",
    "mysql-sredb06.xh"
     ],

Recovered from DeadMaster on mysql-sredb04.xh:3306. Failed: mysql-sredb04.xh:3306; Promoted: mysql-sredb06.xh:3306

Secondly, can i prevent the operation of promotion when the lag is more than a threshold value? And are there any method to filling-in the data which is missing in slave?

Thanks for answer.

shlomi-noach commented 6 years ago

how can i know which slave would be the new master? Is it completely random?

It's absolutely not random, but also quite depends on the state of the topology. Some discussion is on http://code.openark.org/blog/mysql/whats-so-complicated-about-a-master-failover, see "Who to promote?" Also see https://github.com/github/orchestrator/blob/master/docs/topology-recovery.md#discussion-recovering-a-dead-master, and also https://speakerdeck.com/shlominoach/reliable-crash-detection-and-failover-with-orchestrator#18 and onwards.

Firstly, can i prevent some slaves to be the new master?

Please see https://github.com/github/orchestrator/blob/master/docs/deployment.md#adding-promotion-rules Also supported but not ad recommended is to list them in config If you've used RecoveryIgnoreHostnameFilters PromotionIgnoreHostnameFilters and still it gets promoted then it's a bug. Please open an issue (different than this) and give all relevant details (what the topology looks like, as much of the config as possible, what happens etc.)

Secondly, can i prevent the operation of promotion when the lag is more than a threshold value?

Please see FailMasterPromotionIfSQLThreadNotUpToDate, https://github.com/github/orchestrator/blob/master/docs/configuration-recovery.md#promotion-actions

And are there any method to filling-in the data which is missing in slave?

There aren't.

Xinglao4 commented 6 years ago

Please see FailMasterPromotionIfSQLThreadNotUpToDate, https://github.com/github/orchestrator/blob/master/docs/configuration-recovery.md#promotion-actions

I understand this parameter in this way: This only is a choice yes or no, but can't distinguish the situation of the big or small lag. Is it true?

Another question, the value of parameter is config in all nodes of orchestrator. When I want to change some parameter, I need to do the same operate of modication on all nodes. Are there any method more convenient?

shlomi-noach commented 6 years ago

I understand this parameter in this way: This only is a choice yes or no, but can't distinguish the situation of the big or small lag. Is it true?

Correct.

Another question, the value of parameter is config in all nodes of orchestrator. When I want to change some parameter, I need to do the same operate of modication on all nodes. Are there any method more convenient?

Sorry, can you please rephrase, I'm not sure I fully understand? But perhaps I should also note people typically use puppet or chef or ansible etc. to manage configuration, and applying a parameter change onto multiple boxes isn't much of a pain.