Automate failover failed.

openark / orchestrator

MySQL replication topology management and HA

Apache License 2.0

5.61k stars 927 forks source link

Automate failover failed. #840

Closed bluven closed 5 years ago

bluven commented 5 years ago

Sorry for this simple question, but I couldn't figure it out. I have read similiar issues.

I have a 3 nodes mysql cluster and I just want a failover. You known, after master is down, one of slaves is promoted to master and another slave take data from new master.

Here is my enviroment:https://github.com/bluven/mysql-replica

I put all the information which you asked in other similiar issues there, /tmp/recovery.log didn't have related information, so I skip it.

BTW. In fact, I wanted to open a issue about another kind failover problem, but it didn't happened because I changed orchestrator configuration. Should I open a new issue or just put it in the same issue.

shlomi-noach commented 5 years ago

Hey, sorry to keep you waiting. I'm overloaded right now, and was hoping maybe someone from the community could chime in.

bluven commented 5 years ago

It's OK. I'm trying to reading the source code to figure it out, I can see it's really a hard job.

cclose commented 5 years ago

It looks to me like Orchestrator is unable to connect to the cluster master.

The log just seems to show orchestrator not able to read the master to me. Try looking through Audit -> Recovery and make sure previous recoveries are acknowledged. Unacknowledged recoveries can block an automated recovery if it's within' the configured time window.

Can you rebuild the environment such that the cluster is healthy then trigger a failover again? If you can, try this:

rotate the orchestrator log
rebuild the cluster
let it sit for a few minutes, being healthy
trigger the failover
let it be unhealthy for a bit

Then copy and upload the log file generated during these steps. That will help show what happened and possibly why. Right now the log file just indicates orchestrator is unable to connect. Possibly because mysql isn't running.

cclose commented 5 years ago

Ah, looking closer, I notice the following issues:

Your master (24801) is downtimed. Downtiming prevents automated recovery. See the docs . In the GUI, this is the crossed out megaphone and can be viewed/removed by clicking on the gear for the server to bring up the details and then clicking the "End Downtime" button at the top
Your master is not writable and replica 24803 is writable (i'm not sure this is a problem, but it's not how a real cluster would be)
Your replica's have GTID enabled, but are not using it. Check how you are invoking CHANGE MASTER TO, you need to be using MASTER_AUTO_POSITION = 1 and not file and pos arguments
You do not have any designated candidates ala orchestrator-client -c register-candidate -i $(hostname) --promotion-rule prefer. I'm unsure if this is a required atm, but it's not a bad idea to set up a cron to run that on master-candidate replicas

bluven commented 5 years ago

@cclose Sorry for replying so late and Thank your reply. I had found that it's being skipped due to downtime master. After removing downtime record manually, recovery is executed. But there is another problem, I'll open another issue to describe it. I think this issue is ok to close.

jkthedba commented 4 years ago

@bluven iam facing similar issue in orchestrator. I shutdown my primary node..and orchestrator is not able to switch to one of my slave.

jkthedba commented 4 years ago

shlomi-noach commented 4 years ago

@khattarjitender05 best to open a new issue, I don't think your case is the same. Regardless:

did you opt-in (enable) failovers for this cluster in the orchestrator configuration?
Can you run the service with --debug and provide the logs around the failure detection time?