openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.64k stars 933 forks source link

After graceful failover errant GTID on demoted master can't keep up on master via "FIX button" in orchestrator GUI #887

Closed hellracer closed 5 years ago

hellracer commented 5 years ago

Hi,

After a successful graceful failover everything is working as expected but the newly demoted master that become slave sure enough can keep up with master, but the errant GTID is hard to fix because it's moving fast specially in a busy system, that render the FIX "Button" in the gui is proved to be difficult to use or can be useless in this situation.

This issue supersede #885 i decided not to re-open it because i haven't clearly understood the issue now it's become clearer here.

Screen Shot 2019-05-13 at 7 14 25 AM

hellracer commented 5 years ago

Just to make clear the GTID errant transaction range is constantly changing, the only way for me to fixed this is to completely stop all DB traffic in the application and click the "FIX button".

Can this be improved? or if this is the intended behaviour there should be a way to inform the user that the fix button can only work if there's no traffic in the whole system, just to removed confusion

anyway it's just my 0.2$.

shlomi-noach commented 5 years ago

Just to make clear the GTID errant transaction range is constantly changing

That's not an expected behavior. This would only happen if the old-master is still taking writes. According to your screenshot it's read_only. But is it also super_read_only? Perhaps there's a user with SUPER privileges still writing? A pt-heartbeat perhaps? An archiving job?

Your current situation is that the old-master is invalid. You should investigate what those errant GTIDs are. You can use orchestrator-client -c which-gtid-errant -i <old.master> and orchestrator-client -c locate-gtid-errant <old.master>.

hellracer commented 5 years ago

@shlomi-noach

Indeed pt-heartbeat is the culprit when previous master was demoted, pt-heartbeat still point to the old master and write on percona.heartbeat table this wouldn't happen if the user run pt-heartbeat doesn't have the SUPER privilege as you have predicted.

That's explain it :)

Again please close this as this is not an orchestrator issue but rather a user error :)