vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.7k stars 2.1k forks source link

Bug Report: Consistent stream of `failed_precondition` errors caused by `GET_LOCK` and failover #17251

Open arthurschreiber opened 3 days ago

arthurschreiber commented 3 days ago

Overview of the Issue

One of our applications makes use of MySQL's GET_LOCK functionality.

This seems to work fine in general with Vitess, but as soon as we run a failover (PlannedReparentShard or external failover via TabletExternallyReparented), we start seeing a stream of FailedPrecondition errors in our vtgate metrics.

We also see the following warning:

Locking heartbeat failed, held locks might be released: target: <keyspace>.0.primary: vttablet: rpc error: code = FailedPrecondition desc = wrong tablet type: PRIMARY, want: REPLICA or []

I tracked down the warning to *ScatterConn.runLockQuery here: https://github.com/vitessio/vitess/blob/216fd70be49fa14ddd22ea97d26a9434770c0ca2/go/vt/vtgate/scatter_conn.go#L292-L299

I think what happens is that during a failover, the lock functionality still tries to check the lock against the old primary, which no longer is serving as a primary but a replica instead. The lock check is failing, but because runLockQuery is run in a separate Goroutine in *ScatterCon.StreamExecuteMulti, the error is not visible to the client. The lock session information is not cleared either, so the vtgate connection is stuck believing that a lock is still being held (and will unsuccessfully re-check whether the lock is still held on every follow up query happening in the session).


I'm not sure what the best approach to handle this could be. In "regular" MySQL, the lock is held as long as the connection is open to MySQL, and only released either when RELEASE_LOCK is called on that same connection or the connection is closed. There's no other way to signal a lock being released to the client.

On vtgate, the only way to simulate this would be to go and close the client connection when the runLockQuery call fails. 😞

Reproduction Steps

n/a

Binary Version

n/a

Operating System and Environment details

n/a

Log Fragments

n/a