sfackler / r2d2

A generic connection pool for Rust
Apache License 2.0
1.49k stars 80 forks source link

`get_timeout` exceeds timeout #116

Open ghost opened 3 years ago

ghost commented 3 years ago

First, thanks for providing the community with multiple database tools.

Sometimes get_timeout(timeout) exceeds timeout. Consider an application that connects to a distributed database with multiple nodes, any of which can serve requests. Suppose the application calls get_timeout, and the receiving database node dies. If the application configured tests on checkout, get_timeout awaits the return of the underlying is_valid call before timing out. is_valid likely executes a trivial SQL query, which encounters TCP errors and retries until the TCP retry limit of the operating system, often 15 minutes, expires. Meanwhile, the application could have obtained a connection to a different database node.

One solution involves modification of the Rust-Postgres and Tokio-Postgres crates. It would expose a function, perhaps is_valid_timeout(timeout), that executes a trivial SQL query or times out. Do you recommend a different approach? If not, would you be open to a pull request in the rust-postgres repository that introduces is_valid_timeout?

sfackler commented 3 years ago

How does something like Hikari handle validity checks?

ghost commented 3 years ago

During validation Hikari temporarily changes the network timeout of the connection here.

ghost commented 3 years ago

Similarly, during validation a PostgresConnectionManager from r2d2-postgres could temporarily change the network timeout of its tls_connector. I will check whether the underlying types expose the necessary functionality.

ghost commented 3 years ago

I will check whether the underlying types expose the necessary functionality.

Unfortunately, they do not. In r2d2-postgres the MakeTlsConnect wraps a tokio_postgres::Socket, which wraps a tokio::net::TcpStream, and neither has configurable timeouts.

sfackler commented 3 years ago

The Javadoc for Connection::setNetworkTimeout is kind of interesting in that it discusses the timeout only applying to reads and not writes. I guess in practice the outbound TCP buffer doesn't fill for a single query?

postgres could definitely grow support for read/write timeouts via tokio-postgres, though it is interesting to note that libpq does not offer that at all from what I can tell. I imagine they might just advise cranking the TCP keepalive interval down and relying on that.

sfackler commented 3 years ago

Thinking about this a bit more, it actually seems to me like TCP keepalive may be the better approach here compared to some kind of read/query timeout. It doesn't make assumptions about how long a query will legitimately take to run, and can detect a disconnect even while the connection is totally idle rather than only at the time of checkout.

ghost commented 3 years ago

My understanding is TCP keepalives depend on three parameters:

tokio::net::TcpStream exposes set_keepalive(duration), which sets the connection’s keepalive time to duration. I think it does not allow configuration of keepalive interval and keepalive probes, whose UNIX defaults are 75 seconds and 9, which would still allow long timeouts.

An alternative is configuring the parameters at the machine level, but doing so could affect other services running on the machine, or the machine may be inaccessible.

I appreciate general read and write timeouts might introduce complexity and agree with your point about query time. Instead, what about limiting timeouts to validity checks that only make trivial queries? The Java interface exposes that functionality through isValid(timeout).

sfackler commented 3 years ago

We could definitely add a variant of is_valid that takes a timeout, but I'd want to be pretty careful about what timeouts are actually passed to it. For example, if get_timeout is called with a 10 second timeout and we're 9.99995 seconds in when a connection becomes available, we probably wouldn't want to try calling is_valid with a 0.00005 second timeout, since that would almost assuredly fail.

ghost commented 3 years ago

Sounds good! I think is_valid(timeout) would ultimately require access to the Tokio runtime, so I will start with a pull request in the rust-postgres repository.

ghost commented 3 years ago

Now that Rust-Postgres exposes is_valid(timeout), I can think of two options that would maintain backward compatibility:

What do you think?

sfackler commented 3 years ago

I think adding ManageTimeout::is_valid_timeout with a default impl seems best.