openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.63k stars 931 forks source link

Is graceful-master-takeover-auto checking for active transactions? #1380

Closed laurent-indermuehle closed 3 years ago

laurent-indermuehle commented 3 years ago

I'm not confident enough to let Orchestrator manipulate my MariaDB replication_sets. In the past, it crashed in the middle of a graceful-master-takeover-auto with too few explanations. Leaving the replication broken or with errant GTID.

To find out what went wrong, I started to read the code. I'm under the impression that Orchestrator doesn't check for active transactions at all. I'm in the right file? https://github.com/openark/orchestrator/blob/dddec836a64e7680fa731ba98473669b17041d37/go/logic/topology_recovery.go#L2022

It may be a problem with MariaDB instead of Orchestrator. I don't know a reliable way to block all new writes transactions other than putting a lock on all tables (FLUSH TABLES WITH READ LOCKS).

What I'm asking really is to help me find where the logic for determining if the takeover is safe is in the code. Thanks.

shlomi-noach commented 3 years ago

I'm under the impression that Orchestrator doesn't check for active transactions at all.

You are correct.

I don't know a reliable way to block all new writes transactions other than putting a lock on all tables (FLUSH TABLES WITH READ LOCKS).

Current behavior is to set tead_only=1, which blocks (rejects) all new transactions and waits for existing transactions to complete. I will avoid using flush tables with read lock as in my experience it is a longer, more resource intensive operation.

where the logic for determining if the takeover is safe is in the code

Alas, as you suggested, there is no check for long running transactions.

laurent-indermuehle commented 3 years ago

As always, your help is quick and clear. Thanks for the reply!

In that case, it may be a job for a hook?

PreFailoverProcesses: executed immediately before orchestrator takes recovery action. Failure (nonzero exit code) of any of these processes aborts the recovery. Hint: this gives you the opportunity to abort recovery based on some internal state of your system. I could add a loop that checks for active connections and returns either after they have all finish or a timeout. Replication-Manager is handling this with many steps. They even kill long-running transactions (after all, I made a notice to users... if they won't stop their applications... it's their fault).

But it may be easier to implement that in Go?

The FLUSH TABLES WITH READ LOCK may be slow, but if it guarantees data integrity, I prefer to have it. In my case speed is not an issue because I'm relying on a floating IP that takes up to 5 minutes to register into DNS.

One thing that is unclear to me, is after the read_only lock is in place. In case a very long transaction makes us hit the timeout. What would Orchestrator do? We could choose between 'let read_only = ON' or 'rollback the takeover'. Again, maybe it's a job for the hook PostUnsuccessfulFailoverProcesses.

binwiederhier commented 3 years ago

I have not read your entire ticket but in our environment, consistency is also most important. We set a flag in consul in the pre failover hook. The flag will take care of stopping new traffic to the MySQL host and kill all active connections. So when we exists the script we know there is no more traffic.

We facilitate this through consul -> consul template -> script -> haproxy. The "GitHub" solution.

shlomi-noach commented 3 years ago

Agreed, hooks are the way to solve this today.

As for orchestrator owning detection of queries, I'm not sure. One problem is giving orchestrator access to the queries. To determine whether there's a long running query means giving orchestrator visibility into the details of all running queries.

Another problem is an inherent race condition: orchestrator may check if there's a long running query, determine there isn't, and just as it proceeds to begin planned failover, a long running query begins. Now, the chances for that are slim, but the idea where the user would "just trust" orchestrator to do this on their behalf means users will think orchestrator is behaving incorrectly when this does happen.

On planned failover, my thought is that since it is a planned operation, the user should prepare towards it and run the operation when it makes sense and when the system is in good state. That may include any number of parameters, one of which are long running queries. Delegating to a pre-failover hook seems like a reasonable approach.

laurent-indermuehle commented 3 years ago

@binwiederhier and @shlomi-noach thanks for your insights.

I'll investigate how to prepare the takeover with hooks then.

My use case is special in the sense that I host databases of many applications which doesn't always have designated administrators. I have to kill and forbid new connections myself.

Since Orchestrator provides all the hooks that I need, I close the issue. Thanks again for your kind help!