launchbadge / sqlx

🧰 The Rust SQL Toolkit. An async, pure Rust SQL crate featuring compile-time checked queries without a DSL. Supports PostgreSQL, MySQL, and SQLite.
Apache License 2.0
13.48k stars 1.28k forks source link

Query timeouts #3060

Open Matthias247 opened 9 months ago

Matthias247 commented 9 months ago

Is your feature request related to a problem? Please describe.

Interactions with databases over a network are roughly structured in 3 phases:

  1. The client connects the database (may include DNS lookup, TCP connection establishment, TLS handshake)
  2. The client sends a query (SQL statement)
  3. The clients receives the response from the server. That might actually be a stream with lots of results (rows)

During any of those phases the connection between client and server can get lost for a variety of reasons - e.g. networking issues, the server being turned off, etc. If that happens the client should quickly be informed about it with an error and move on (e.g. by retrying), and not be stuck.

By the nature of TCP/IP, the application (client) will not be guaranteed to be informed about each of those cases:

  1. If a client is actively trying to send data in phase 2 (outgoing TCP write buffer not empty) and does not get acknowledegments, the TCP connection will indeed be marked as failed by the OS after some long amount of time (about 20min, depends on kernel configs). It will also be marked as failed if the retransmission of a packet caused a peer to emit a TCP reset packet - e.g. because the server is no longer running and there is another component that knows about this and actually sends reset.
  2. If a client is just trying to receive data, nothing will happen. The client will just stick around forever trying to read data from the TCP connection. That is applicable in phase 3) of the query sequence.

To avoid stuck applications (e.g. as described in https://github.com/launchbadge/sqlx/issues/2840), it is necessary that applications implement application side timeouts for all phases to avoid getting stuck between a very long time (for writes) and forever (for reads).

sqlx already implements these timeouts for phase 1) - the connection establishment - via the means of the timeout of acquiring a connection from the pool.

However sqlx does not seem to provide timeouts for the actual queries in phase 2) and 3).

Async Rust users can theoretically wrap every sqlx query in tokio::timeout and co to implement the desired behavior and to avoid stuck queries. However

For this reason it would be great if sqlx could provide internal query timeouts for phases 2) and 3).

Theres probably various ways how timeouts could be defined: a) An overall timeout for the query. The deadline will be determined when the query is sent and each follow-up IO operation (sending data, reading peer data) is counted against that deadline. b) Individual timeouts for all phases of the query: Sending the query, streaming additional query data (rows), receiving the first part of the response, receiving follow-up rows. c) The sockets that are used for communicating with the peer can be wrapped in another AsyncRead/Write wrapper which implements a fixed read or write timeout. That at least guarantees that the application is alive and is able to exchange some bytes once in a while.

With HTTP where streams can be very long-lived option b) is probably the way to go, but I'm not sure whether that would be required for SQL.

Option c) might be the easiest to implement - but I don't know enough about sqlx internals to know for sure. Compared to a) it would however still allow for queries which can overall take an indefinite amount of time.

Safe Defaults

I would also recommend to also enable query timeouts by default, so that unaware users still get safe behavior. While that might be technically a breaking change, a high timeout (like 3 minutes) is probably safe for nearly every application.

About TCP keepalives

Another option to avoid TCP connections from being stuck forever is by enabling TCP keepalives on the socket. While this option also exists, its unfortunately less reliable than implementing timeouts in the application. The main reason here is that keepalives are a TCP protocol extension and therefore is not universally available or might get filtered by middleboxes.

ValentinColin commented 6 months ago

This is a very interesting feature in a production environment. It also offers the possibility of disabling the ping (check_idle_conn) to help the database under heavy load, and to always have a safeguard to avoid blocking in any situation.