ecortestws commented 7 years ago

Hi,

I am testing orchestrator with 5.7.17, Master and two slaves. Have moved one of the slaves to change the topology like A-B-C and then executed orchestrator -c graceful-master-takeover -alias myclusteralias

The issues found are:

GTID appears as disabled in the master, the web interface shows the button to enable it, when obviously it is enabled in all the replication chain (GTID_MODE=ON). Slaves are showed with GTID enabled.
This issue causes that the takeover doesn't use GTID (I guess)
Instance B was in read-only before the takeover, after the takeover, the read-only is not disabled, is this a feature or something that should I add via hooks? Should be nice to have a parameter to end the process in the status that you prefer, depending on the takeover reasons/conditions.
Also, for any reason the role change old-master-> new slave doesn't work. It executes a CHANGE MASTER but apparently the replication username in the old master is empty, failing the change master operation (orchestrator user has SELECT ON mysql.slave_master_info in the cluster).
Finally, should be nice to add a feature to force to refactor the topology when you have one master and several slaves below. It requires moving slaves below the new elected master, just before the master-takeover. The process will take a bit longer, moving the slaves, and waiting until they are ready.

Thanks for this amazing tool! Regards, Eduardo

shlomi-noach commented 7 years ago

GTID appears as disabled in the master, the web interface shows the button to enable it, when obviously it is enabled in all the replication chain (GTID_MODE=ON)

Can you please issue:

select @@global.gtid_mode, @@global.gtid_purged

on your master?

This issue causes that the takeover doesn't use GTID (I guess)

That makes sense. We need to solve the GTID recognition problem.

Instance B was in read-only before the takeover, after the takeover, the read-only is not disabled, is this a feature or something that should I add via hooks? Should be nice to have a parameter to end the process in the status that you prefer, depending on the takeover reasons/conditions.

By default orchestrator does not RESET SLAVE ALL and does not SET GLOBAL read_only=0. To do both, set ApplyMySQLPromotionAfterMasterFailover: true

I apologize for the inconvenient name. I'll be working to minimize the number of configuration params.

Also, for any reason the role change old-master-> new slave doesn't work. It executes a CHANGE MASTER but apparently the replication username in the old master is empty, failing the change master operation (orchestrator user has SELECT ON mysql.slave_master_info in the cluster).

As per https://github.com/github/orchestrator/pull/57:

What's missing in this story is the MASTER_USER and MASTER_PASSWORD, which are likely to not exist, because the old master was like not having replication info. So that leads to the case where even after positioning, the old master cant truly replicate from the promoted master. Nonetheless, it is placed in the correct position to assume replication once credential settings are applied.

The problem is orchestrator doesn't have the username & password of your replication user.

Finally, should be nice to add a feature to force to refactor the topology when you have one master and several slaves below. It requires moving slaves below the new elected master, just before the master-takeover. The process will take a bit longer, moving the slaves, and waiting until they are ready.

This can be easily scripted on the user's side. I really think that in the event of planned takeover the user should choose the identity of the new master. If orchestrator were to choose the identity -- fine, but no promises held that everything would work. Perhaps your setup is such where the promoted server would not be the one you'd expect. You may find such statement confusing. Your own setup may be simple enough, but there are various setups that are not as simple to deal with: servers with no log-slave-updates (can happen with 5.7 GTID), a mixture of 5.6 and 5.7 etc. Some servers may not be able to grab the VIP the current master has. Or are in an unreliable physical location. Please understand orchestrator has "seen it all" and much of its behavior is crafted by experiencing non-trivial scenarios.

To this end, when things go bad, orchestrator is very smart in making the best of a situation. But at planned failovers, it would very much like you to set up your topology in a way that makes sense to you and will guarantee survival of all servers you care about.

ecortestws commented 7 years ago

GTID data from the master:

mysql>  select @@global.gtid_mode, @@global.gtid_purged\G
*************************** 1. row ***************************
@@global.gtid_mode: ON
@@global.gtid_purged: 17255cd9-b2f6-11e6-b59d-005056946d8b:1-15546,
604d9088-a5c6-11e6-8f72-005056945836:1-8796013,
9cb4118b-a5c6-11e6-96c0-005056945189:1-37645
1 row in set (0.00 sec)

There are so many options that didn't see that. Will check it, thanks!
About username/password issue, orchestrator can read username/password from the new master just before the take over, and use them with the old master demote. Otherwise, some kind of warning about "no credentials found" or something else should be useful.
You are right about planned thing, is just that reading the documentation I read that orchestrator could require several steps to finish in the target state. This would be that scenario, and the only requirement could be that you specify the new master rather than let orchestrator to select it. As you said, can be done at user's side :-)

shlomi-noach commented 7 years ago

GTID

looking into!

About username/password issue, orchestrator can read username/password from the new master just before the take over

It cannot. You cannot reveal the password by SHOW SLAVE STATUS. There is a potential solution (utilized by orchestrator) in the event you use system tables for master-info.

is just that reading the documentation I read that orchestrator could require several steps to finish in the target state

More than anything, I'd appreciate help with documentation!

sjmudd commented 7 years ago

@shlomi-noach: slave username and password information are available via mysql.slave_master_info:

root@somehost [mysql]> select Host, Port,  User_name, User_password from slave_master_info;
+-----------------------+------+-----------+---------------+
| Host                  | Port | User_name | User_password |
+-----------------------+------+-----------+---------------+
| somehost.mydomain.com | 3306 | some_user | some_password |
+-----------------------+------+-----------+---------------+
1 row in set (0.00 sec)

So in theory you could try these credentials. However, depending on existing grants this may or many not work as expected as the grant may be for 'some_user'@'%' (any address), 'some_user'@'192.168.9.10' (specific address) or any combination between, some of which may work and others may not.

It may be worth having an option to try or check the configuration but specific site configs may vary.

For what it's worth for planned topology changes (of the master) I don't use orchestrator but custom scripts. This gives a bit more control and reduces downtime, but orchestrator is nearly always used manually both before and afterwards to arrange the topology as needed to minimise the impact of the master changeover. I guess I could use orchestrator and most of what's described here is what I do already but I have more freedom to check stuff both before and afterwards which makes me feel more comfortable. Maybe I need to look again at how well orchestrator handles this task as it simplifies things if the amount of software used is reduced.

ecortestws commented 7 years ago

That's right, orchestrator user has select permissions on slave_master_info and the information is there, just need to read it and use it. If you expect that orchestrator executes this task cleanly, the user should ensure that replication user has permissions in all the nodes involved.

shlomi-noach commented 7 years ago

Reading credentials from slave_master_info is already implemented for make-co-master so should be easy to apply to graceful-takeover

https://github.com/github/orchestrator/blob/55cedffe8da1163df6d6d2374207cae97ae375fe/go/inst/instance_topology.go#L900-L911

fuyar commented 7 years ago

Hello, just wanted to report the same issue with GTID based replication (Percona 5.7.16) not being recognized on a simple master - 4 slaves topology.

On the master :

+--------------------+----------------------+
| @@global.gtid_mode | @@global.gtid_purged |
+--------------------+----------------------+
| ON                 |                      |
+--------------------+----------------------+
1 row in set (0.00 sec)

shlomi-noach commented 7 years ago

@fuyar thank you!

fuyar commented 7 years ago

no problem @shlomi-noach :)

Seems like Orchestrator was finally able to detect GTIDs on the 4 slaves (I rechecked this morning while doing nothing previously).

oracle_gtid: 0 still for the master in the 'database_instance' table but yeah as the master is not a slave of anyone it should be ok I suppose ?

shlomi-noach commented 7 years ago

but yeah as the master is not a slave of anyone it should be ok I suppose ?

~~That's the very bug; because the master is not identified as gtid-enabled, orchestrator doesn't run a gtid-based failover.~~

shlomi-noach commented 7 years ago

OK, have taken a closer look into GTID recoveries:

The fact the master does not show as oracle_gtid in database_instance, or that it shows as GTID based replication: false is irrelevant to the failover mechanism.
The failover looks at the set of replicas and determines that they're using GTIDs ; GTID-based failover takes place when there's at least one GTID replica, and when all valid replicas (valid == responsive) are GTID. In other words, if a single valid replica is not a GTID based replica, then failover is not GTID based. This shouldn't happen in reality, but added as a safety mechanism in the unlikelihood that GTID->non-GID replication is made possible in the far far future.

shlomi-noach commented 7 years ago

@ecortestws my last comment suggests that:

This issue causes that the takeover doesn't use GTID (I guess)

is wrong. Are you able to show that the recovery was not based on GTID? I do mean it's a completely valid assumption on your side, but I believe is incorrect. The logs actually specify the type of recovery. Look for:

topology_recovery: RecoverDeadMaster: masterRecoveryType=...

I realize that was 15 days ago and you may not have the logs at this time.

ecortestws commented 7 years ago

Hi @shlomi-noach: from the logs: 2017-02-14 08:15:34 DEBUG topology_recovery: RecoverDeadMaster: masterRecoveryType=MasterRecoveryPseudoGTID

shlomi-noach commented 7 years ago

@ecortestws thank you. Then, indeed, orchestrator didn't recognize this to be a GTID recovery.

shlomi-noach commented 7 years ago

Applying replication-credentials on demoted master is addressed by https://github.com/github/orchestrator/pull/93

ecortestws commented 7 years ago

Hi @shlomi-noach, any progress on GTID issue? Thanks Eduardo

shlomi-noach commented 7 years ago

@ecortestws Perfect timing. I am setting up an environment for this now.

shlomi-noach commented 7 years ago

@ecortestws can you confirm your servers are Percona Server? If so, this is identified in https://github.com/github/orchestrator/issues/96 and solved via https://github.com/github/orchestrator/pull/98 (no release yet)

My current GTID testing environment is happily identifying GTID topologies.

https://github.com/github/orchestrator/pull/106 makes the web interface recognize a GTID master as "using GTID" -- but this is a visualization matter only; recoveries are using a lower level logic.

shlomi-noach commented 7 years ago

@ecortestws can you test https://github.com/github/orchestrator/releases/tag/v2.1.0 ?

ecortestws commented 7 years ago

@shlomi-noach my servers are Oracle MySQL. Will try the new release and let you know.

ecortestws commented 7 years ago

@shlomi-noach I have tested it but it didn't work as expected.

orchestrator -version
2.1.0
05241ab2608de7ed5dd66a363690a33db36e9954

2017-02-14 08:15:34 DEBUG topology_recovery: RecoverDeadMaster: masterRecoveryType=MasterRecoveryPseudoGTID
2017-02-14 08:15:35 INFO ChangeMasterTo: Changed master on 10.102.92.162:3306 to: 10.102.92.161:3306, bin-log.000256:80539410. GTID: false 10.102.92.161:3306

ecortestws commented 7 years ago

The web interface now shows GTID enabled in the master.

shlomi-noach commented 7 years ago

@ecortestws thank you. Are you again looking at a A-B-C chain with graceful-master-takeover?

I'll run some more checks and may come back with more questions.

ecortestws commented 7 years ago

@shlomi-noach yes, the same approach, the same topology. Moved C from A to B before the takeover, and verified that the replication chain was healthy. I have all the logs, let me know if you need anything else. I understand that #93 hasn't been merged yet, so the issue with the credentials after the takeover is expected. Thanks.

shlomi-noach commented 7 years ago

93 is now merged

shlomi-noach commented 7 years ago

@ecortestws I'm happy if you can share the logs. If they contain sensitive data, can you please share them with me via email? My address is shlomi-noach@-youknowhichcompany-.com

shlomi-noach commented 7 years ago

OK I'm able to reproduce this.

The reason this happens: the auto_position is not set by default, and orchestrator uses that to recognize GTID replication. I'm looking into improving this.

shlomi-noach commented 7 years ago

@ecortestws can you please confirm https://github.com/github/orchestrator/releases/tag/v2.1.1-BETA works for you?

Make sure that the replicas are on auto_position=1, as this is a requirement for a GTID-based recovery.

ecortestws commented 7 years ago

@shlomi-noach it worked, but the replication was not started in the demoted master. Is it a expected behavior? The credentials were in place, and after execute "START SLAVE" in the old master it started syncing with the new master.

shlomi-noach commented 7 years ago

@ecortestws This is expected behavior. I see advantages and reasons for both starting and not starting replication automatically; "not starting" is on the safer side.

openark / orchestrator

GTID not found properly (5.7) and some graceful-master-takeover issues #78

93 is now merged