Implement retries for availability faults

ivansenic commented 1 year ago

Spec defines the Availability Faults. We need to add implementation for this.

Currently the Stargate V2 allows retries based on the gRPC status codes, we need to see if this is enough. It could be that we need to extend the bridge, so that we can recognize client-side and server-side timeouts. Or client side timeouts can be disabled for now.

ivansenic commented 1 year ago

I'll try to explain here the current situation and come up with conclusions on what has to be done.

From the Availability Faults, we said we want to have a single retry for Unavailable, WriteTimeout and ReadTimeout exceptions that we receive.

Unavailable

Stargate already retries all the UNAVAILABLE gRPC status codes on the client side, see gRPC CONFIGURATION.md. We receive UNAVAILABLE as gRPC status in following cases:

UnavailableException (includes explicit unavailable trailer)
UnhandledClientException
in case C* error code is IS_BOOTSTRAPPING (not sure what exception exactly has this)
in case bridge endpoint is not reachable

Can you @amorton confirm that retrying these cases once using existing mechanics we have in the Stargate gRPC client is OK?

Timeouts

When timeouts occur, we receive gRPC status code DEADLINE_EXCEEDED. Now, we need to distinguish between client and server timeouts. They will carry on same status code, but server side timeouts will have explicit trailers.

Client side

The client timeout is set by default to 30 sec in the gRPC CONFIGURATION.md. I think we can agree client timeouts should never be retried. However, question is if we want to have timeouts at all. Receiving a client timeout, means that it's unsure how the operation execution finished on the server. It could be successful. Here I see few options:

disable client timeouts
clearly distinguish between client timeouts and server timeouts and report to the user in the error message

Server side

The gRPC Bridge defines a default retry strategy in the DefaultRetryPolicy.java. Here by default read and write timeouts are already retried in the coordinator once, if certain conditions are met:

read - based on the ReadTimeoutException rte state: rte.received >= rte.blockFor && !rte.dataPresent
writes - only for write type BATCH_LOG

Thus, from client point of view, some timeouts are already retried. But since we don't even do batch logs, maybe we should only retry all the write timeouts?

However, this again opens a question of the client timeouts and how we should handle this. Since we have a chance in that single CQL to end up a timeout on the server side, in order to receive the server side timeout, we need to have client side timeout long enough.

Furthermore in some scenarios, we have a proxy between client and a bridge. Depending on what we agree on, we must ensure not to receive the timeout from the proxy as well. The proxy responds with HTML code and we must avoid this.

I would like a whole team to contribute to this, as this issue is a concern in the OSS as well. Pinging @jeffreyscarpenter @tatu-at-datastax @kathirsvn @maheshrajamani for discussion.

amorton commented 1 year ago

Can you @amorton confirm that retrying these cases once using existing mechanics we have in the Stargate gRPC client is OK?

Assuming the "Stargate already retries all the UNAVAILABLE gRPC status codes " is the client side configuration for retries, this makes sense for the actual Cassandra server raised UnavailableException. There others look like application errors from the bridge or gRPC .

UnhandledClientException I could not find this in the C* code base, when is this raised ?

in case C* error code is IS_BOOTSTRAPPING (not sure what exception exactly has this) Do you have a link for where this is raised from. Bootstrapping is something that will happen when a new node has joined the cluster and is streaming data from other nodes.

However, question is if we want to have timeouts at all. Receiving a client timeout, means that it's unsure how the operation execution finished on the server. It could be successful. Here I see few options:

For client timeouts I am assuming we are talking about TCP socket timeouts. We should have client timeouts, and they should be large enough so the C timeouts fail first. We need the client timeouts to handle cases where the C coordinator / node fails to take the frame of the TCP stack and start working on it.

Client side timeouts should be considered internal application errors, no retry, just fail. Yes we dont know the state of the request, if we fail to timeout our client connection we will end up hanging the client calling us. If the client calling us does not have a socket timeout then we could, in theory, have a lot hung requests that consume resources and result in denial of service type errors.

The gRPC Bridge defines a default retry strategy in the DefaultRetryPolicy.java. Here by default read and write timeouts are already retried in the coordinator once, if certain conditions are met:

Not a big fan of this, it is behaviour that is counter to the way C* works and IMHO should not be in the coordinator tier. We now have retry logic in two places (client and coordinator), and the policy for the coordinator is (i assume) applied to all request so cannot be tailored. Can it be removed from the coordinator? this logic should live in the client.

Thus, from client point of view, some timeouts are already retried. But since we don't even do batch logs, maybe we should only retry all the write timeouts?

Unsure on the ask, we want to retry write and read timeouts each once.

Furthermore in some scenarios, we have a proxy between client and a bridge. Is this something we have in Astra or for OSS deployments ?

In general:

we need socket (client) timeouts as a safety valve, they should be set 2X what we know or guess the C* server timeouts to be.
retry logic should be in the client (not the coordinator) as that is where it is expected , and we can change it with the least impact on other services.
anything not Unavailable or (read, write, or CAS) timeout can be considered an application error and we can fail without retry.

sync-by-unito[bot] commented 1 year ago

➤ Ivan Senic commented:

Needed impl in the OSS done here: https://github.com/stargate/stargate/pull/2517 ( https://github.com/stargate/stargate/pull/2517|smart-link )

sync-by-unito[bot] commented 1 year ago

➤ Ivan Senic commented:

jsonapi PR: https://github.com/stargate/jsonapi/pull/309 ( https://github.com/stargate/jsonapi/pull/309|smart-link )

stargate / data-api