Open Xinglao4 opened 6 years ago
I must apologize for misleading you. RecoveryIgnoreHostnameFilters
does not indicate servers which cannot be promoted. It indicates servers for which analysis is skipped/ignored.
The parameter you're looking for is PromotionIgnoreHostnameFilters
.
Regardless, I advise using a dynamic approach of orchestrator -c register-candidate -i mysql-sredb06.yp:3306 -promotion-rule must_not
I see. Now, I've configure like this:
"PromotionIgnoreHostnameFilters": [
"mysql-sredb06.yp",
"mysql-sredb05.yp"
],
"FailureDetectionPeriodBlockMinutes": 1,
"RecoveryPeriodBlockSeconds": 10,
I tested two times. The first time it's ok, and the second time there are still a problem.
2018-03-05 15:38:14 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-05 15:38:14 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-05 15:38:25 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-05 15:38:26 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-05 15:38:26 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb07.yp:3306
2018-03-05 15:38:26 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb07.yp:3306
2018-03-05 15:38:27 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb07.yp:3306
2018-03-05 15:38:27 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb07.yp:3306
2018-03-05 15:43:08 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 3
2018-03-05 15:43:08 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-05 15:43:10 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-05 15:43:10 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb05.yp:3306
2018-03-05 15:43:10 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb05.yp:3306
Are there any other parameters need to be configured?
Your two experiments above look different, and so a re not comparable. See the Affected replicas: 1
as compared to Affected replicas: 3
.
Can you please repeat and dump the topology before the operation? Also, are you acknowledging the recoveries?
2018-03-05 15:43:08 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 3
2018-03-05 15:43:08 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-05 15:43:10 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
This log printed in the same experiment, but the Affected replicas has been changed. Actually, the number of slave is always three. And I don't know why the value of Affected replicas has been changed.
Also, are you acknowledging the recoveries?
In my understanding, the acknowledged is prepared for anti-flapping mechanism, and I need to acknowledging the recoveries only when there are block. I have configured the RecoveryPeriodBlockSeconds in 10. Is there still a block?
Can you please repeat and dump the topology before the operation? Also, are you acknowledging the recoveries?
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03
orchestrator-client[4370]: reason must be provided
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 --reason="dba has taken taken necessary steps"
/usr/local/bin/orchestrator-client: illegal option -- -
0
orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 -reason="dba has taken taken necessary steps"
0
Sorry, how can I acknowledge the recoveries? Is it ok when return a '0'?
The initial topology of this cluster is like this:
[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb07.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
Then, I stop the MySQL of master, and the change of topology is following below:
[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306 [unknown,invalid,5.6.24-72.2-log,rw,ROW,>>,GTID]
[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb07
mysql-sredb07.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
- mysql-sredb07.yp:3306 [null,nonreplicating,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb07.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
The process of this time is ok (but the Affected replicas still changed) and the instance of mysql-sredb07.yp is promoted.
2018-03-06 14:27:51 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 3
2018-03-06 14:27:51 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-06 14:27:53 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-06 14:27:53 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb07.yp:3306
2018-03-06 14:27:53 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb07.yp:3306
And then I test the second time. Firstly I acknowledge the recoveries via command line interface:
[root@mysql-sredb03.xh ~]# date
2018年 03月 06日 星期二 14:29:15 CST
[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb07.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03
orchestrator-client[4370]: reason must be provided
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 --reason="dba has taken taken necessary steps"
/usr/local/bin/orchestrator-client: illegal option -- -
0
[root@mysql-sredb03.xh ~]# rchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 -reason="dba has taken taken necessary steps"
-bash: rchestrator-client: command not found
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 -reason="dba has taken taken necessary steps"
0
Secondly, I stop the MySQL of master, and the instance of mysql-sredb05.yp is promoted:
2018-03-06 14:34:34 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 3
2018-03-06 14:34:34 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-06 14:34:36 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-06 14:34:36 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb05.yp:3306
2018-03-06 14:34:37 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb05.yp:3306
But the config is like this:
[root@mysql-sredb03.xh ~]# cat /etc/orchestrator.conf.json | grep -A3 PromotionIgnoreHostnameFilters
"PromotionIgnoreHostnameFilters": [
"mysql-sredb06.yp",
"mysql-sredb05.yp"
],
Sorry, how can I acknowledge the recoveries? Is it ok when return a '0'?
Agreed that response is unclear. There are two scenarios where 0
makes sense, let's assume it is fine for now.
Are there any wrong in my two experiments? Or it is actual a bug?
it is actual a bug?
"it" being the acknowledgements? No, just the unclear output. "it" being the failovers? I haven't investigated yet.
"it" being the failovers? I haven't investigated yet.
Yep. Ok. Looking forward to the result.
Can you please clarify: does this behavior reproduce? Any time you run two successive failovers, one of the forbidden servers is promoted on the 2nd attempt?
Also can you please confirm you have restarted orchestrator
after making configuration changes, or at least loaded /api/reload-configuration
?
Can you please clarify: does this behavior reproduce? Any time you run two successive failovers, one of the forbidden servers is promoted on the 2nd attempt?
I will test more times to confirm this. I have reproduced several times, but I don't know whether its happens every 2nd times. Need I acknowledge the recoveries every time? I think its not necessary because the value of "RecoveryPeriodBlockSeconds" is 10.
Also can you please confirm you have restarted orchestrator after making configuration changes,
Yes, I have restarted orchestrator after making configuration changes.
Need I acknowledge the recoveries every time? I think its not necessary because the value of "RecoveryPeriodBlockSeconds" is 10.
If you've waited a little bit beyond 10sec
in between, that should be fine and you don't need to acknowledge.
I do not reproduce the same problem now... When it happens again, I will tell you. Thanks.
Thank you. I'll try to investigate this nonetheless.
Hi @shlomi-noach , As mentioned in https://github.com/github/orchestrator/issues/419: I've used RecoveryIgnoreHostnameFilters and still it gets promoted. The topology looks like this:
I've configure like this:
When I stop the MySQL server on mysql-sredb03.yp, the mysql-sredb06.yp has been promoted.
Are there something wrong?