vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.68k stars 2.1k forks source link

Dead connections in tablet pools can cause errors in some cases #7065

Open dweitzman opened 3 years ago

dweitzman commented 3 years ago

This is not the most helpful bug report, but my naive attempt to reproduce this by just setting wait_timeout and interactive_timeout to low numbers in mysql and sending some queries didn't work.

We observed on 7.0.3 that if mysql connections in the connection pool have be killed by some mysterious outside force, some vindex lookup insert statements can be made from vtgate in such a way that they fail at the tablet level without retries after trying to use a bad connection from the connection pool.

Ideally when either starting a transaction or using an autocommit transaction from a transaction pool if there's a connection error the connection would be silently recycled / reconnected without failing the tablet query server rpc, and I suspect that in most cases that actually does work.

It seems like there may be some edge case involving autocommit transactions and DMLs where the tablet doesn't realize it can do a retry if the error was a connection error

There does seem to be a certain amount of danger in doing retries after a connection error with a DML on an autocommit connection, for fear that WriteComQuery() didn't get a TCP ack (if you were using a tablet over the network with something like RDS) but the write actually did happen.

I wonder if it would make sense to have some sort of preemptive health check ComPing on a connection when it's pulled from a pool if it hasn't been used at all in the last few minutes, so for a tablet under load there would be essentially no extra cost but for a low-qps tablet there'd be an extra layer of defense against this type of bug.

aquarapid commented 3 years ago

Related: #7290