Avoid failing requests when re-establishing connections to the cluster

kostja commented 9 months ago

Currently the python driver is frivolously failing network requests if for whatever reason there is no connection to the cluster. or the socket is dead. However, there may be many valid reasons why the connection is not available, and if the request is not in progress, it should not be dropped.

Observed in https://github.com/scylladb/scylladb/issues/16110 and in https://github.com/scylladb/scylladb/issues/14746 , where a belated notification about a node restart leads to a spurious test failure.

How the driver should behave:

if there is no viable connection in the pool, wait until there is an available connection, and then try to execute the request.
if there is a write failure when trying to write the request data to the socket, read(0 bytes) from the socket, to see if there is an EOF. If there is an eof, e.g. the physical connection is dead, try to open another connection, and retry the request.

The two steps above should significantly reduce the amount of spurious failures on topology changes.

avelanarius commented 9 months ago

Refs https://github.com/scylladb/java-driver/issues/236 (from a different perspective, but a similar issue)

avelanarius commented 7 months ago

How the driver should behave:

if there is no viable connection in the pool, wait until there is an available connection, and then try to execute the request.

if there is a write failure when trying to write the request data to the socket, read(0 bytes) from the socket, to see if there is an EOF. If there is an eof, e.g. the physical connection is dead, try to open another connection, and retry the request. The two steps above should significantly reduce the amount of spurious failures on topology changes.

Regarding the proposed new behavior, wouldn't it be better if the request was sent to another replica instead of waiting for a connection to the replica that we had connection issues with? However what I've just described is essentially a retry policy that the driver already has support for...

kbr-scylla commented 6 months ago

if there is a write failure when trying to write the request data to the socket, read(0 bytes) from the socket, to see if there is an EOF. If there is an eof, e.g. the physical connection is dead, try to open another connection, and retry the request.

This is only safe for idempotent queries.

Regarding the proposed new behavior, wouldn't it be better if the request was sent to another replica instead of waiting for a connection to the replica that we had connection issues with?

The client may direct the query to a specific replica.

So I generally agree that we should avoid failing requests. This should be done by not closing connections that are live and currently running queries.

If the driver for whatever reason really really needs to reopen a connection, it should do it in phases:

open a new connection, but don't yet close the old one
start routing new queries to the new connection
drain queries from the old connection (i.e. wait for all existing queries to finish)
only then close the old connection

Retrying queries can only happen if the user requests for such policy, and only for idempotent queries.

The most problems we have in tests is because the driver is sometimes closing connections for no apparent reason -- e.g. after node restart, we verify on the test side that a SELECT directed to this node succeeds (which implies that the driver already managed to open a new connection to this node!), and then subsequent query fails (because driver decides to close it again.) If we prevent that from happening, we'd be good.

kostja commented 6 months ago

if there is a write failure when trying to write the request data to the socket, read(0 bytes) from the socket, to see if there is an EOF. If there is an eof, e.g. the physical connection is dead, try to open another connection, and retry the request.

This is only safe for idempotent queries.

This is useful, first of all, during testing. Your choice in the test: fail the test because of a connection error right away, or retry and maaaybe fail the test because of the lack of idempotency + double execution.

How likely there is a double execution? If you get a write failure from a syscall it means the OS actually failed to store the request in the system TCP buffer. It is in theory possible that the request is actually stored and then sent to the destination, however, in practice, the implementations are careful enough to not do that. I haven't seen this ever leading to a double execution in my life (although I agree one has to study the actual tcp stack implementation to be sure).

Regarding the proposed new behavior, wouldn't it be better if the request was sent to another replica instead of waiting for a connection to the replica that we had connection issues with?

The client may direct the query to a specific replica.

So I generally agree that we should avoid failing requests. This should be done by not closing connections that are live and currently running queries.

If the driver for whatever reason really really needs to reopen a connection, it should do it in phases:

open a new connection, but don't yet close the old one

start routing new queries to the new connection

drain queries from the old connection (i.e. wait for all existing queries to finish)

only then close the old connection

Retrying queries can only happen if the user requests for such policy, and only for idempotent queries.

The most problems we have in tests is because the driver is sometimes closing connections for no apparent reason -- e.g. after node restart, we verify on the test side that a SELECT directed to this node succeeds (which implies that the driver already managed to open a new connection to this node!), and then subsequent query fails (because driver decides to close it again.) If we prevent that from happening, we'd be good.

I don't mind this suggestion at all (just more extensive than what I proposed).

kbr-scylla commented 4 months ago

The most problems we have in tests is because the driver is sometimes closing connections for no apparent reason

Example of this behavior: https://github.com/scylladb/python-driver/issues/317

scylladb / python-driver

Avoid failing requests when re-establishing connections to the cluster #273