outbrain / orchestrator

MySQL replication topology manager/visualizer
Other
829 stars 168 forks source link

When testing Master failover and reattaching failed master getting ReadTopologyInstance error #284

Open leeparayno opened 7 years ago

leeparayno commented 7 years ago

I have 3 Percona MySQL 5.6.29-76.2-log instances in separate VirtualBox VMs running CentOS 7.0.

The prior replication configuration was:

mysql56b-2
+ mysql56b-1
+ mysql56b-3

Upon blocking 3306 on mysql56b-2, failover initiated and mysql56b-1 was made a slave of mysql56b-3. mysql56b-2 was not showing up in the topology shown on the Orchestrator UI any longer in the "mysql56b-2 cluster".

I was attempting to allow the old master (mysql56b-2) rejoin the cluster.

I set gtid_purged to the values that mysql56b-2 was showing in gtid_executed.

After repointing to mysql56b-3 using the change master command, for some reason, it appeared that replication was attempting to run through the existing transactions that had already been run.

For all the UUID/GTID combinations in the gtid_execute, I created empty transactions and set the gtid_next to the maximum transaction value for each UUID that was already showing had already been executed by that slave. So this should essentially make it ready to connect to the new master and retrieve any new transactions as necessary and catch up to the other replicas and new master.

However, Orchestrator was stuck with this error:

ERROR ReadTopologyInstance(mysql56b-2:3306) show global status like 'Uptime': Error 1837: When @@SESSION.GTID_NEXT is set to a GTID, you must explicitly set it to a different value after a COMMIT or ROLLBACK. Please check GTID_NEXT variable manual page for detailed explanation. Current @@SESSION.GTID_NEXT is 'd1da7519-fdb9-11e5-8407-08002720ea52:111'.

On mysql56b-2 was showing gtid_next as 'AUTOMATIC', gtid_purged as the current set of transactions:

3d83956c-e8a3-11e5-ba83-080027da8259:1-5,
743902dd-97cf-11e6-b0c9-080027a97f61:1-9,
d1da7519-fdb9-11e5-8407-08002720ea52:1-11

Note: I tried a few failovers to different nodes and ran transactions, which is why there are received/executed transactions from each of the 3 nodes.

mysql56b-2 was showing now issues with replication in "show slave status" and all appeared to be in sync after reattaching to mysql56b-3.

I couldn't get Orchestrator to refresh the current state until I "forgot" mysql56b-2 and restarted Orchestrator to let it be rediscovered.

shlomi-noach commented 7 years ago

I'm not sure I understand if this is an orchestrator problem or a GTID problem. You say orchestrator said:

ERROR ReadTopologyInstance(mysql56b-2:3306) show global status like 'Uptime': Error 1837: When @@SESSION.GTID_NEXT is set to a GTID, you must explicitly set it to a different value after a COMMIT or ROLLBACK. Please check GTID_NEXT variable manual page for detailed explanation. Current @@SESSION.GTID_NEXT is 'd1da7519-fdb9-11e5-8407-08002720ea52:111'.

and then that message only went away when you forgot and rediscovered the host? Or were there further steps in between?

s/mysql56b-2 was showing now issues/mysql56b-2 was showing no issues/g -- correct?

leeparayno commented 7 years ago

After I reassigned the old master back as a slave of the new master, I originally got this error in SHOW SLAVE STATUS, but fixed the replication issue by creating empty transactions for all the transactions that had already been executed.

So at the time I was still seeing the ReadTopologyInstance errors, the show slave status on mysql56b-2 was no longer showing any issues.

On Nov 7, 2016, at 5:47 AM, Shlomi Noach notifications@github.com wrote:

I'm not sure I understand if this is an orchestrator problem or a GTID problem. You say orchestrator said:

ERROR ReadTopologyInstance(mysql56b-2:3306) show global status like 'Uptime': Error 1837: When @@SESSION.GTID_NEXT is set to a GTID, you must explicitly set it to a different value after a COMMIT or ROLLBACK. Please check GTID_NEXT variable manual page for detailed explanation. Current @@SESSION.GTID_NEXT is 'd1da7519-fdb9-11e5-8407-08002720ea52:111'.

and then that message only went away when you forgot and rediscovered the host? Or were there further steps in between?

s/mysql56b-2 was showing now issues/mysql56b-2 was showing no issues/g -- correct?

Correct, there were no more issues with replication.

This makes it look like Orchestrator was caching a previous error and maintaining that state.

You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/outbrain/orchestrator/issues/284#issuecomment-258839315

Lee Parayno

shlomi-noach commented 7 years ago

Thank you. I've never witnessed this kind of behavior before. I will do some digging.

shlomi-noach commented 7 years ago

Looking slightly more into this, a couple more questions:

leeparayno commented 7 years ago

Yes it was in the orchestrator log.

In the GUI, it was reporting the old replication error on the instance. It looked like orchestrator was failing to read the instance’s current state, as the topology was updated to the show the new position as a slave of the new master, but was not showing as replicating correctly.

Lee Parayno

On Nov 14, 2016, at 6:04 AM, Shlomi Noach <notifications@github.com mailto:notifications@github.com> wrote:

Looking slightly more into this, a couple more questions:

  • I assume you saw this error on the orchestrator log, correct? And likely this also showed at the GUI's instance dialog?
  • Other than this error showing up, did orchestrator fail to read the instance? To show the topology?

You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/outbrain/orchestrator/issues/284#issuecomment-260342740 https://github.com/outbrain/orchestrator/issues/284#issuecomment-260342740