One of our applications makes use of MySQL's GET_LOCK functionality.
This seems to work fine in general with Vitess, but as soon as we run a failover (PlannedReparentShard or external failover via TabletExternallyReparented), we start seeing a stream of FailedPrecondition errors in our vtgate metrics.
We also see the following warning:
Locking heartbeat failed, held locks might be released: target: <keyspace>.0.primary: vttablet: rpc error: code = FailedPrecondition desc = wrong tablet type: PRIMARY, want: REPLICA or []
I think what happens is that during a failover, the lock functionality still tries to check the lock against the old primary, which no longer is serving as a primary but a replica instead. The lock check is failing, but because runLockQuery is run in a separate Goroutine in *ScatterCon.StreamExecuteMulti, the error is not visible to the client. The lock session information is not cleared either, so the vtgate connection is stuck believing that a lock is still being held (and will unsuccessfully re-check whether the lock is still held on every follow up query happening in the session).
I'm not sure what the best approach to handle this could be. In "regular" MySQL, the lock is held as long as the connection is open to MySQL, and only released either when RELEASE_LOCK is called on that same connection or the connection is closed. There's no other way to signal a lock being released to the client.
On vtgate, the only way to simulate this would be to go and close the client connection when the runLockQuery call fails. 😞
Overview of the Issue
One of our applications makes use of MySQL's
GET_LOCK
functionality.This seems to work fine in general with Vitess, but as soon as we run a failover (
PlannedReparentShard
or external failover viaTabletExternallyReparented
), we start seeing a stream ofFailedPrecondition
errors in our vtgate metrics.We also see the following warning:
I tracked down the warning to
*ScatterConn.runLockQuery
here: https://github.com/vitessio/vitess/blob/216fd70be49fa14ddd22ea97d26a9434770c0ca2/go/vt/vtgate/scatter_conn.go#L292-L299I think what happens is that during a failover, the lock functionality still tries to check the lock against the old primary, which no longer is serving as a primary but a replica instead. The lock check is failing, but because
runLockQuery
is run in a separate Goroutine in*ScatterCon.StreamExecuteMulti
, the error is not visible to the client. The lock session information is not cleared either, so thevtgate
connection is stuck believing that a lock is still being held (and will unsuccessfully re-check whether the lock is still held on every follow up query happening in the session).I'm not sure what the best approach to handle this could be. In "regular" MySQL, the lock is held as long as the connection is open to MySQL, and only released either when
RELEASE_LOCK
is called on that same connection or the connection is closed. There's no other way to signal a lock being released to the client.On
vtgate
, the only way to simulate this would be to go and close the client connection when therunLockQuery
call fails. 😞Reproduction Steps
n/a
Binary Version
Operating System and Environment details
Log Fragments