Open brendar opened 2 months ago
It looks like there's a race between servenv.OnClose(srv.rollbackAtShutdown)
https://github.com/vitessio/vitess/blob/60d8927af1402221bd130011d933cffa1a983bcb/go/vt/vtgate/vtgate.go#L369
and vtg.Gateway().Close(ctx)
https://github.com/vitessio/vitess/blob/60d8927af1402221bd130011d933cffa1a983bcb/go/cmd/vtgate/cli/cli.go#L192-L194
This causes txc.queryService(ctx, s.TabletAlias)
in TxConn.ReleaseAll
to get error tablet: cell:"zone1" uid:101 is either down or nonexistent
because the healthcheck has been closed.
https://github.com/vitessio/vitess/blob/60d8927af1402221bd130011d933cffa1a983bcb/go/vt/vtgate/tx_conn.go#L427
As far as fixing this, it seems like rollbackAtShutdown
shouldn't be run OnClose
since the components it depends on are likely being shut down simultaneously.
Would it make sense to call rollbackAtShutdown
at the end of shutdownMysqlProtocolAndDrain (which is called OnTermSync
)? It might require setting a limit on how long shutdownMysqlProtocolAndDrain
waits for connections to be drained/idle, to leave enough time for rollbacks to complete before onterm_timeout
is reached, but it looks plausible.
The impact of this bug is likely reduced in v19+ due to https://github.com/vitessio/vitess/pull/14219
That change ensures that new transactions cannot be started during vtgate shutdown, which reduces the number of transactions left open at shutdown. At that point only transactions which were started before the shutdown period (onterm_timeout
which defaults to 10s) would still be open.
As suggested, looks like rollbackAtShutdown
be best called inside shutdownMysqlProtocolAndDrain
but we would need some wait timeout before proceeding with it.
If #14219 reduces the issue for you, I would hold off any changes in this area for this issue.
Overview of the Issue
On shutdown, vtgate will log messages indicating that it has rolled back open transactions, but they're not being rolled back.
The rollback on shutdown functionality was introduced in https://github.com/vitessio/vitess/pull/5659 but I don't know if it ever worked. The test accompanying that PR doesn't actually assert transaction rollback, it's just asserting that a new transaction can't see rows written by the other open transaction, which is expected behavior at the default isolation level.
I've updated that test in https://github.com/vitessio/vitess/pull/16839 which now fails with
Reproduction Steps
1. Setup (shell terminal 1)
2. Start a transaction (mysql terminal)
3. Shut down vtgate (shell terminal 2)
Results After 10 seconds the vtgate-down script will exit after the vtgate process terminates
After 30 seconds you'll see a vttablet log like
But the vtgate log will contain a message like
Binary Version
Operating System and Environment details
Log Fragments
No response