openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.63k stars 930 forks source link

Maybe a bug during recovery. #426

Open Xinglao4 opened 6 years ago

Xinglao4 commented 6 years ago

Hi @shlomi-noach , As mentioned in https://github.com/github/orchestrator/issues/419: I've used RecoveryIgnoreHostnameFilters and still it gets promoted. The topology looks like this:

[root@mysql-sredb02.xh ~]# orchestrator-client -c topology -i mysql-sredb03.yp
mysql-sredb03.yp:3306   [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb07.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]

image

I've configure like this:

"RecoveryIgnoreHostnameFilters": [
    "mysql-sredb06.yp",
    "mysql-sredb05.yp"
  ],

When I stop the MySQL server on mysql-sredb03.yp, the mysql-sredb06.yp has been promoted.

2018-03-05 14:33:48 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 3
2018-03-05 14:33:48 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-05 14:33:50 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-05 14:33:50 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb06.yp:3306
2018-03-05 14:33:50 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb06.yp:3306

Are there something wrong?

shlomi-noach commented 6 years ago

I must apologize for misleading you. RecoveryIgnoreHostnameFilters does not indicate servers which cannot be promoted. It indicates servers for which analysis is skipped/ignored.

https://github.com/github/orchestrator/blob/45e8e28cdec10d638b604a5fcb5b468179c299f2/go/config/config.go#L216

The parameter you're looking for is PromotionIgnoreHostnameFilters.

Regardless, I advise using a dynamic approach of orchestrator -c register-candidate -i mysql-sredb06.yp:3306 -promotion-rule must_not

Xinglao4 commented 6 years ago

I see. Now, I've configure like this:

"PromotionIgnoreHostnameFilters": [
    "mysql-sredb06.yp",
    "mysql-sredb05.yp"
  ],

"FailureDetectionPeriodBlockMinutes": 1,
"RecoveryPeriodBlockSeconds": 10,

I tested two times. The first time it's ok, and the second time there are still a problem.

2018-03-05 15:38:14 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-05 15:38:14 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-05 15:38:25 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-05 15:38:26 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-05 15:38:26 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb07.yp:3306
2018-03-05 15:38:26 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb07.yp:3306
2018-03-05 15:38:27 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb07.yp:3306
2018-03-05 15:38:27 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb07.yp:3306

2018-03-05 15:43:08 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 3
2018-03-05 15:43:08 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-05 15:43:10 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-05 15:43:10 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb05.yp:3306
2018-03-05 15:43:10 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb05.yp:3306

Are there any other parameters need to be configured?

shlomi-noach commented 6 years ago

Your two experiments above look different, and so a re not comparable. See the Affected replicas: 1 as compared to Affected replicas: 3.

Can you please repeat and dump the topology before the operation? Also, are you acknowledging the recoveries?

Xinglao4 commented 6 years ago
2018-03-05 15:43:08 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 3
2018-03-05 15:43:08 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-05 15:43:10 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1

This log printed in the same experiment, but the Affected replicas has been changed. Actually, the number of slave is always three. And I don't know why the value of Affected replicas has been changed.

Also, are you acknowledging the recoveries?

In my understanding, the acknowledged is prepared for anti-flapping mechanism, and I need to acknowledging the recoveries only when there are block. I have configured the RecoveryPeriodBlockSeconds in 10. Is there still a block?

shlomi-noach commented 6 years ago

Can you please repeat and dump the topology before the operation? Also, are you acknowledging the recoveries?

Xinglao4 commented 6 years ago
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03
orchestrator-client[4370]: reason must be provided
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 --reason="dba has taken taken necessary steps"
/usr/local/bin/orchestrator-client: illegal option -- -
0
orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 -reason="dba has taken taken necessary steps"
0

Sorry, how can I acknowledge the recoveries? Is it ok when return a '0'?

Xinglao4 commented 6 years ago

The initial topology of this cluster is like this:

[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306   [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb07.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]

Then, I stop the MySQL of master, and the change of topology is following below:

[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306 [unknown,invalid,5.6.24-72.2-log,rw,ROW,>>,GTID]

[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb07
mysql-sredb07.yp:3306   [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]

[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306     [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
- mysql-sredb07.yp:3306   [null,nonreplicating,5.6.24-72.2-log,rw,ROW,>>,GTID]
  + mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
  + mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]

[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306   [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb07.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]

The process of this time is ok (but the Affected replicas still changed) and the instance of mysql-sredb07.yp is promoted.

2018-03-06 14:27:51 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 3
2018-03-06 14:27:51 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-06 14:27:53 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-06 14:27:53 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb07.yp:3306
2018-03-06 14:27:53 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb07.yp:3306
Xinglao4 commented 6 years ago

And then I test the second time. Firstly I acknowledge the recoveries via command line interface:

[root@mysql-sredb03.xh ~]# date
2018年 03月 06日 星期二 14:29:15 CST
[root@mysql-sredb03.xh ~]# orchestrator-client -c topology -i mysql-sredb03
mysql-sredb03.yp:3306   [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb05.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb06.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
+ mysql-sredb07.yp:3306 [0s,ok,5.6.24-72.2-log,rw,ROW,>>,GTID]
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03
orchestrator-client[4370]: reason must be provided
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 --reason="dba has taken taken necessary steps"
/usr/local/bin/orchestrator-client: illegal option -- -
0
[root@mysql-sredb03.xh ~]# rchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 -reason="dba has taken taken necessary steps"
-bash: rchestrator-client: command not found
[root@mysql-sredb03.xh ~]# orchestrator-client -c ack-cluster-recoveries -alias mysql-sredb03 -reason="dba has taken taken necessary steps"
0

Secondly, I stop the MySQL of master, and the instance of mysql-sredb05.yp is promoted:

2018-03-06 14:34:34 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 3
2018-03-06 14:34:34 Will recover from DeadMaster on mysql-sredb03.yp:3306
2018-03-06 14:34:36 Detected DeadMaster on mysql-sredb03.yp:3306. Affected replicas: 1
2018-03-06 14:34:36 Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Promoted: mysql-sredb05.yp:3306
2018-03-06 14:34:37 (for all types) Recovered from DeadMaster on mysql-sredb03.yp:3306. Failed: mysql-sredb03.yp:3306; Successor: mysql-sredb05.yp:3306

But the config is like this:

[root@mysql-sredb03.xh ~]# cat /etc/orchestrator.conf.json | grep -A3 PromotionIgnoreHostnameFilters
  "PromotionIgnoreHostnameFilters": [
    "mysql-sredb06.yp",
    "mysql-sredb05.yp"
  ],
shlomi-noach commented 6 years ago

Sorry, how can I acknowledge the recoveries? Is it ok when return a '0'?

Agreed that response is unclear. There are two scenarios where 0 makes sense, let's assume it is fine for now.

Xinglao4 commented 6 years ago

Are there any wrong in my two experiments? Or it is actual a bug?

shlomi-noach commented 6 years ago

it is actual a bug?

"it" being the acknowledgements? No, just the unclear output. "it" being the failovers? I haven't investigated yet.

Xinglao4 commented 6 years ago

"it" being the failovers? I haven't investigated yet.

Yep. Ok. Looking forward to the result.

shlomi-noach commented 6 years ago

Can you please clarify: does this behavior reproduce? Any time you run two successive failovers, one of the forbidden servers is promoted on the 2nd attempt?

shlomi-noach commented 6 years ago

Also can you please confirm you have restarted orchestrator after making configuration changes, or at least loaded /api/reload-configuration?

Xinglao4 commented 6 years ago

Can you please clarify: does this behavior reproduce? Any time you run two successive failovers, one of the forbidden servers is promoted on the 2nd attempt?

I will test more times to confirm this. I have reproduced several times, but I don't know whether its happens every 2nd times. Need I acknowledge the recoveries every time? I think its not necessary because the value of "RecoveryPeriodBlockSeconds" is 10.

Xinglao4 commented 6 years ago

Also can you please confirm you have restarted orchestrator after making configuration changes,

Yes, I have restarted orchestrator after making configuration changes.

shlomi-noach commented 6 years ago

Need I acknowledge the recoveries every time? I think its not necessary because the value of "RecoveryPeriodBlockSeconds" is 10.

If you've waited a little bit beyond 10sec in between, that should be fine and you don't need to acknowledge.

Xinglao4 commented 6 years ago

I do not reproduce the same problem now... When it happens again, I will tell you. Thanks.

shlomi-noach commented 6 years ago

Thank you. I'll try to investigate this nonetheless.