Open Matthias247 opened 9 months ago
This is a very interesting feature in a production environment. It also offers the possibility of disabling the ping (check_idle_conn) to help the database under heavy load, and to always have a safeguard to avoid blocking in any situation.
Is your feature request related to a problem? Please describe.
Interactions with databases over a network are roughly structured in 3 phases:
During any of those phases the connection between client and server can get lost for a variety of reasons - e.g. networking issues, the server being turned off, etc. If that happens the client should quickly be informed about it with an error and move on (e.g. by retrying), and not be stuck.
By the nature of TCP/IP, the application (client) will not be guaranteed to be informed about each of those cases:
To avoid stuck applications (e.g. as described in https://github.com/launchbadge/sqlx/issues/2840), it is necessary that applications implement application side timeouts for all phases to avoid getting stuck between a very long time (for writes) and forever (for reads).
sqlx already implements these timeouts for phase 1) - the connection establishment - via the means of the timeout of acquiring a connection from the pool.
However sqlx does not seem to provide timeouts for the actual queries in phase 2) and 3).
Async Rust users can theoretically wrap every sqlx query in
tokio::timeout
and co to implement the desired behavior and to avoid stuck queries. HoweverFor this reason it would be great if sqlx could provide internal query timeouts for phases 2) and 3).
Theres probably various ways how timeouts could be defined: a) An overall timeout for the query. The deadline will be determined when the query is sent and each follow-up IO operation (sending data, reading peer data) is counted against that deadline. b) Individual timeouts for all phases of the query: Sending the query, streaming additional query data (rows), receiving the first part of the response, receiving follow-up rows. c) The sockets that are used for communicating with the peer can be wrapped in another
AsyncRead/Write
wrapper which implements a fixed read or write timeout. That at least guarantees that the application is alive and is able to exchange some bytes once in a while.With HTTP where streams can be very long-lived option b) is probably the way to go, but I'm not sure whether that would be required for SQL.
Option c) might be the easiest to implement - but I don't know enough about sqlx internals to know for sure. Compared to a) it would however still allow for queries which can overall take an indefinite amount of time.
Safe Defaults
I would also recommend to also enable query timeouts by default, so that unaware users still get safe behavior. While that might be technically a breaking change, a high timeout (like 3 minutes) is probably safe for nearly every application.
About TCP keepalives
Another option to avoid TCP connections from being stuck forever is by enabling TCP keepalives on the socket. While this option also exists, its unfortunately less reliable than implementing timeouts in the application. The main reason here is that keepalives are a TCP protocol extension and therefore is not universally available or might get filtered by middleboxes.