Closed ecortestws closed 7 years ago
Can you please issue:
select @@global.gtid_mode, @@global.gtid_purged
on your master?
That makes sense. We need to solve the GTID recognition problem.
By default orchestrator
does not RESET SLAVE ALL
and does not SET GLOBAL read_only=0
. To do both, set ApplyMySQLPromotionAfterMasterFailover: true
I apologize for the inconvenient name. I'll be working to minimize the number of configuration params.
As per https://github.com/github/orchestrator/pull/57:
What's missing in this story is the MASTER_USER and MASTER_PASSWORD, which are likely to not exist, because the old master was like not having replication info. So that leads to the case where even after positioning, the old master cant truly replicate from the promoted master. Nonetheless, it is placed in the correct position to assume replication once credential settings are applied.
The problem is orchestrator
doesn't have the username & password of your replication user.
This can be easily scripted on the user's side. I really think that in the event of planned takeover the user should choose the identity of the new master. If orchestrator
were to choose the identity -- fine, but no promises held that everything would work. Perhaps your setup is such where the promoted server would not be the one you'd expect.
You may find such statement confusing. Your own setup may be simple enough, but there are various setups that are not as simple to deal with: servers with no log-slave-updates
(can happen with 5.7
GTID), a mixture of 5.6
and 5.7
etc.
Some servers may not be able to grab the VIP the current master has. Or are in an unreliable physical location. Please understand orchestrator
has "seen it all" and much of its behavior is crafted by experiencing non-trivial scenarios.
To this end, when things go bad, orchestrator
is very smart in making the best of a situation. But at planned failovers, it would very much like you to set up your topology in a way that makes sense to you and will guarantee survival of all servers you care about.
mysql> select @@global.gtid_mode, @@global.gtid_purged\G
*************************** 1. row ***************************
@@global.gtid_mode: ON
@@global.gtid_purged: 17255cd9-b2f6-11e6-b59d-005056946d8b:1-15546,
604d9088-a5c6-11e6-8f72-005056945836:1-8796013,
9cb4118b-a5c6-11e6-96c0-005056945189:1-37645
1 row in set (0.00 sec)
There are so many options that didn't see that. Will check it, thanks!
About username/password issue, orchestrator can read username/password from the new master just before the take over, and use them with the old master demote. Otherwise, some kind of warning about "no credentials found" or something else should be useful.
You are right about planned thing, is just that reading the documentation I read that orchestrator could require several steps to finish in the target state. This would be that scenario, and the only requirement could be that you specify the new master rather than let orchestrator to select it. As you said, can be done at user's side :-)
GTID
looking into!
About username/password issue, orchestrator can read username/password from the new master just before the take over
It cannot. You cannot reveal the password by SHOW SLAVE STATUS
. There is a potential solution (utilized by orchestrator
) in the event you use system tables for master-info.
is just that reading the documentation I read that orchestrator could require several steps to finish in the target state
More than anything, I'd appreciate help with documentation!
@shlomi-noach: slave username and password information are available via mysql.slave_master_info
:
root@somehost [mysql]> select Host, Port, User_name, User_password from slave_master_info;
+-----------------------+------+-----------+---------------+
| Host | Port | User_name | User_password |
+-----------------------+------+-----------+---------------+
| somehost.mydomain.com | 3306 | some_user | some_password |
+-----------------------+------+-----------+---------------+
1 row in set (0.00 sec)
So in theory you could try these credentials. However, depending on existing grants this may or many not work as expected as the grant may be for 'some_user'@'%'
(any address), 'some_user'@'192.168.9.10'
(specific address) or any combination between, some of which may work and others may not.
It may be worth having an option to try or check the configuration but specific site configs may vary.
For what it's worth for planned topology changes (of the master) I don't use orchestrator but custom scripts. This gives a bit more control and reduces downtime, but orchestrator is nearly always used manually both before and afterwards to arrange the topology as needed to minimise the impact of the master changeover. I guess I could use orchestrator and most of what's described here is what I do already but I have more freedom to check stuff both before and afterwards which makes me feel more comfortable. Maybe I need to look again at how well orchestrator handles this task as it simplifies things if the amount of software used is reduced.
That's right, orchestrator user has select permissions on slave_master_info and the information is there, just need to read it and use it. If you expect that orchestrator executes this task cleanly, the user should ensure that replication user has permissions in all the nodes involved.
Reading credentials from slave_master_info
is already implemented for make-co-master
so should be easy to apply to graceful-takeover
Hello, just wanted to report the same issue with GTID based replication (Percona 5.7.16) not being recognized on a simple master - 4 slaves topology.
On the master :
+--------------------+----------------------+
| @@global.gtid_mode | @@global.gtid_purged |
+--------------------+----------------------+
| ON | |
+--------------------+----------------------+
1 row in set (0.00 sec)
@fuyar thank you!
no problem @shlomi-noach :)
Seems like Orchestrator was finally able to detect GTIDs on the 4 slaves (I rechecked this morning while doing nothing previously).
oracle_gtid: 0 still for the master in the 'database_instance' table but yeah as the master is not a slave of anyone it should be ok I suppose ?
but yeah as the master is not a slave of anyone it should be ok I suppose ?
That's the very bug; because the master is not identified as gtid-enabled, orchestrator
doesn't run a gtid-based failover.
OK, have taken a closer look into GTID recoveries:
oracle_gtid
in database_instance
, or that it shows as GTID based replication: false
is irrelevant to the failover mechanism.@ecortestws my last comment suggests that:
This issue causes that the takeover doesn't use GTID (I guess)
is wrong. Are you able to show that the recovery was not based on GTID? I do mean it's a completely valid assumption on your side, but I believe is incorrect. The logs actually specify the type of recovery. Look for:
topology_recovery: RecoverDeadMaster: masterRecoveryType=...
I realize that was 15
days ago and you may not have the logs at this time.
Hi @shlomi-noach:
from the logs:
2017-02-14 08:15:34 DEBUG topology_recovery: RecoverDeadMaster: masterRecoveryType=MasterRecoveryPseudoGTID
@ecortestws thank you. Then, indeed, orchestrator
didn't recognize this to be a GTID recovery.
Applying replication-credentials on demoted master is addressed by https://github.com/github/orchestrator/pull/93
Hi @shlomi-noach, any progress on GTID issue? Thanks Eduardo
@ecortestws Perfect timing. I am setting up an environment for this now.
@ecortestws can you confirm your servers are Percona Server? If so, this is identified in https://github.com/github/orchestrator/issues/96 and solved via https://github.com/github/orchestrator/pull/98 (no release yet)
My current GTID testing environment is happily identifying GTID topologies.
https://github.com/github/orchestrator/pull/106 makes the web interface recognize a GTID master as "using GTID" -- but this is a visualization matter only; recoveries are using a lower level logic.
@ecortestws can you test https://github.com/github/orchestrator/releases/tag/v2.1.0 ?
@shlomi-noach my servers are Oracle MySQL. Will try the new release and let you know.
@shlomi-noach I have tested it but it didn't work as expected.
orchestrator -version
2.1.0
05241ab2608de7ed5dd66a363690a33db36e9954
2017-02-14 08:15:34 DEBUG topology_recovery: RecoverDeadMaster: masterRecoveryType=MasterRecoveryPseudoGTID
2017-02-14 08:15:35 INFO ChangeMasterTo: Changed master on 10.102.92.162:3306 to: 10.102.92.161:3306, bin-log.000256:80539410. GTID: false 10.102.92.161:3306
The web interface now shows GTID enabled in the master.
@ecortestws thank you. Are you again looking at a A-B-C
chain with graceful-master-takeover
?
I'll run some more checks and may come back with more questions.
@shlomi-noach yes, the same approach, the same topology. Moved C from A to B before the takeover, and verified that the replication chain was healthy. I have all the logs, let me know if you need anything else. I understand that #93 hasn't been merged yet, so the issue with the credentials after the takeover is expected. Thanks.
@ecortestws I'm happy if you can share the logs. If they contain sensitive data, can you please share them with me via email? My address is shlomi-noach@-youknowhichcompany-.com
OK I'm able to reproduce this.
The reason this happens: the auto_position
is not set by default, and orchestrator
uses that to recognize GTID replication. I'm looking into improving this.
@ecortestws can you please confirm https://github.com/github/orchestrator/releases/tag/v2.1.1-BETA works for you?
Make sure that the replicas are on auto_position=1
, as this is a requirement for a GTID-based recovery.
@shlomi-noach it worked, but the replication was not started in the demoted master. Is it a expected behavior? The credentials were in place, and after execute "START SLAVE" in the old master it started syncing with the new master.
@ecortestws This is expected behavior. I see advantages and reasons for both starting and not starting replication automatically; "not starting" is on the safer side.
Hi,
I am testing orchestrator with 5.7.17, Master and two slaves. Have moved one of the slaves to change the topology like A-B-C and then executed orchestrator -c graceful-master-takeover -alias myclusteralias
The issues found are:
mysql
.slave_master_info
in the cluster).Thanks for this amazing tool! Regards, Eduardo