vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.73k stars 2.1k forks source link

Bug Report: VTGate does not rollback open transactions on shutdown #16840

Open brendar opened 2 months ago

brendar commented 2 months ago

Overview of the Issue

On shutdown, vtgate will log messages indicating that it has rolled back open transactions, but they're not being rolled back.

The rollback on shutdown functionality was introduced in https://github.com/vitessio/vitess/pull/5659 but I don't know if it ever worked. The test accompanying that PR doesn't actually assert transaction rollback, it's just asserting that a new transaction can't see rows written by the other open transaction, which is expected behavior at the default isolation level.

I've updated that test in https://github.com/vitessio/vitess/pull/16839 which now fails with

Error:          Received unexpected error:
                target: ks.0.primary: vttablet: Statement aborted because lock(s) could not be acquired immediately and NOWAIT is set. (errno 3572) (sqlstate HY000) (CallerID: userData1): Sql: "select * from buffer where id = :id /* INT64 */ for update nowait", BindVars: {#maxLimit: "type:INT64 value:\"10001\""id: "type:INT64 value:\"3\""} (errno 3572) (sqlstate HY000) during query: select * from buffer where id = 3 for update nowait
Test:           TestTransactionRollBackWhenShutDown
Messages:       for query: select * from buffer where id = 3 for update nowait

Reproduction Steps

1. Setup (shell terminal 1)

./101_initial_cluster.sh
mysql -e 'show vitess_tablets' # determine the primary tablet
tail -f vtdataroot/vt_0000000101/vttablet.out # or whichever tablet is the primary

2. Start a transaction (mysql terminal)

use commerce;
begin;
insert into customer (email) values ('foo');

3. Shut down vtgate (shell terminal 2)

../common/scripts/vtgate-down.sh

Results After 10 seconds the vtgate-down script will exit after the vtgate process terminates

After 30 seconds you'll see a vttablet log like

W0924 13:04:32.225585   85718 tx_pool.go:133] killing transaction (exceeded timeout: 30s): 1727197254734999002  'principal:"brendan" subcomponent:"VTGate MySQL Connector"' 'username:"userData1"'  Sep 24 13:04:02.185697  Jan  1 00:00:00.000000  -9223372036.854776      insert into customer(email) values ('foo');

But the vtgate log will contain a message like

I0924 13:04:16.362320   87658 plugin_mysql_server.go:708] All connections have been rolled back.

Binary Version

Tested on main (`83b37b8e8fbc2353dd511f0e1223271e5879556c`) and v18

Operating System and Environment details

Tested on linux and locally on macos

Log Fragments

No response

brendar commented 2 months ago

It looks like there's a race between servenv.OnClose(srv.rollbackAtShutdown) https://github.com/vitessio/vitess/blob/60d8927af1402221bd130011d933cffa1a983bcb/go/vt/vtgate/vtgate.go#L369

and vtg.Gateway().Close(ctx) https://github.com/vitessio/vitess/blob/60d8927af1402221bd130011d933cffa1a983bcb/go/cmd/vtgate/cli/cli.go#L192-L194

This causes txc.queryService(ctx, s.TabletAlias) in TxConn.ReleaseAll to get error tablet: cell:"zone1" uid:101 is either down or nonexistent because the healthcheck has been closed. https://github.com/vitessio/vitess/blob/60d8927af1402221bd130011d933cffa1a983bcb/go/vt/vtgate/tx_conn.go#L427

brendar commented 2 months ago

As far as fixing this, it seems like rollbackAtShutdown shouldn't be run OnClose since the components it depends on are likely being shut down simultaneously.

Would it make sense to call rollbackAtShutdown at the end of shutdownMysqlProtocolAndDrain (which is called OnTermSync)? It might require setting a limit on how long shutdownMysqlProtocolAndDrain waits for connections to be drained/idle, to leave enough time for rollbacks to complete before onterm_timeout is reached, but it looks plausible.

brendar commented 1 month ago

The impact of this bug is likely reduced in v19+ due to https://github.com/vitessio/vitess/pull/14219 That change ensures that new transactions cannot be started during vtgate shutdown, which reduces the number of transactions left open at shutdown. At that point only transactions which were started before the shutdown period (onterm_timeout which defaults to 10s) would still be open.

harshit-gangal commented 1 month ago

As suggested, looks like rollbackAtShutdown be best called inside shutdownMysqlProtocolAndDrain but we would need some wait timeout before proceeding with it.

If #14219 reduces the issue for you, I would hold off any changes in this area for this issue.