Distinguish between ambiguous and unambiguous failures

bdarnell commented 7 years ago

What version of gRPC are you using?

[[projects]]
  name = "google.golang.org/grpc"
  packages = [".","codes","credentials","credentials/oauth","grpclb/grpc_lb_v1","grpclog","internal","keepalive","metadata","naming","peer","stats","status","tap","transport"]
  revision = "172ccacff3cfbddedcf1ba5ea06cda943bc976c1"
  version = "v1.5.0"

What version of Go are you using (`go version`)?

1.8.3

What operating system (Linux, Windows, …) and version?

Linux (ubuntu 16.04)

What did you do?

I'm trying to send an rpc with a side effect that should happen at most once. I have connections to multiple backends and use the FailFast option; if one backend is down I want to try another.

What did you expect to see?

This requires that I can distinguish the error returned from a backend that is found to be unavailable before the request is sent from one that fails later. (I had previously assumed that codes.Unavailable would be suitable for this, but I was wrong - it's also used when a server is draining, when the request may have already been processed)

What did you see instead?

The fail-fast error for a down backend has the same error code as the one used for a draining server (and possibly other uses); it is impossible to distinguish them without looking at the text of the error message.

I would like some way to inspect the error to determine whether the request was possibly processed by the server. This could be a distinct error code, string matching on the error message (if the string to use is documented and guaranteed for future versions), or other attributes of the error. It's ok if this mechanism is not perfectly precise (my priority is at-most-once execution instead of maximizing the chances of success), but I would like to at least be able to distinguish the fail-fast error linked above.

(for further background, our incorrect assumption that an error with codes.Unavailable meant that the RPC had not been attempted on the server was the cause of https://github.com/cockroachdb/cockroach/issues/17491)

dfawley commented 7 years ago

Per the gRPC spec, Unavailable is expected to be returned when the server shuts down or when some data is transmitted before the connection breaks. However, the Wait For Ready spec does not specify what kind of failure should be returned if there is an error attempting to issue an RPC to a ClientConn for which there is no current connection available. Unavailable seems like the most reasonable choice for this condition, but it does prevent this kind of use case, which is important. It's also important to have a different error code (and not just string matching or a sentinel error), because gRPC's retry mechanism will be based upon the code alone. We'll follow up with other gRPC implementations and see how they handle this.

dfawley commented 7 years ago

Actually, I think the root cause of this issue is that our non-Wait For Ready / FailFast behavior is wrong. We are supposed to block if there is no active connection (i.e. the ClientConn's state is Idle or Connecting) until the context expires:

RPCs SHOULD NOT fail as a result of the channel being in other states (CONNECTING, READY, or IDLE).

At that time, the error returned should be either Canceled or DeadlineExceeded.

If you turn off Fail Fast (i.e. if you enable Wait For Ready), then RPCs will always block until the context is canceled or expires, or the ClientConn is closed.

If you are maintaining multiple connections to multiple backends, you should do that within a single ClientConn, and gRPC should always pick one that is up for your RPCs.

dfawley commented 7 years ago

Let's track the wait for ready bug via #1463. Closing this as a duplicate of that, since it is the root cause.

bdarnell commented 7 years ago

I don't understand how #1463 is the root cause of this issue. For one thing it looks like the code is already working as you say it should, and only returning fail-fast errors for the transient-failure state:

https://github.com/grpc/grpc-go/blob/fa59b779e8729b2ae0d0cfca12a91220f690fa6f/clientconn.go#L1186-L1190

But even if fail-fast were changed to return errors less often, the errors still wouldn't let us know whether the request was possibly executed or not.

If you are maintaining multiple connections to multiple backends, you should do that within a single ClientConn, and gRPC should always pick one that is up for your RPCs.

I don't think that's an option for us - we need to choose which backend(s) to use on an RPC-by-RPC basis. Is there some way I'm not seeing to configure gRPC to do this?

dfawley commented 7 years ago

I don't understand how #1463 is the root cause of this issue.

Are you using a load balancer? If so, depending on the implementation (our RoundRobin balancer does this), it will return an error when there are no backends here:

https://github.com/grpc/grpc-go/blob/fa59b779e8729b2ae0d0cfca12a91220f690fa6f/clientconn.go#L866

If you are not using a balancer, then it does seem that we follow the right behavior (though I was fairly sure I observed the opposite).

Is your connection actually in the transient failure state, possibly?

Note that when retry support is added (ETA ~1 month), gRPC will transparently retry RPCs that didn't leave the client or were not seen by the server application. So this should not be something you need to distinguish externally after that time.

we need to choose which backend(s) to use on an RPC-by-RPC basis. Is there some way I'm not seeing to configure gRPC to do this?

You should be able to implement a custom balancer to do this if round-robin is not suitable for your needs. (Note that the API is going to change around here pretty significantly in the next month or two - see #1388.)

bdarnell commented 7 years ago

Are you using a load balancer?

No. In part due to the documented experimental/unstable nature of the interface (as mentioned below).

Is your connection actually in the transient failure state, possibly?

Yes. We want to distinguish between "connection was transient-failed before this rpc was sent, so we didn't try" and "connection failed while this rpc was in flight".

Note that when retry support is added (ETA ~1 month)

Interesting. That in conjunction with a custom balancer might do the trick, if the interfaces let us do what we want. But in general our usage is atypical enough that we're more comfortable taking manual control here; my first choice is to just get a distinct error code.

dfawley commented 7 years ago

In theory, the errors we return are supposed to match the spec. (We are aware we don't get this right in all instances.) The spec doesn't specify what should be returned in this particular situation. I'm checking with C/Java.

This leaves us with either a gRFC to change the error (if it's UNAVAILABLE in the other languages as well), which could be tricky, or a Go-specific feature to help identify whether data was transmitted before any error occurred.

I'll reopen this, because it seems it's not related to our "wait for ready" implementation.

bdarnell commented 7 years ago

OK, so the spec clearly says we were wrong to assume that Unavailable means "didn't send anything" (I think it happened to be true of the go implementation at the time we made this assumption). And there's no standard error code we could use to be unambiguous (indeed, the spec explicitly notes the ambiguity of DEADLINE_EXCEEDED).

I'm not sure if an error code is the right way to handle this, though. If I make an RPC to a server that makes another RPC, does the error code from the inner RPC get propagated back to my client or does it get turned into Unknown? It would be incorrect to allow the proposed "unambiguous did-not-send" error code to propagate, since it's only true of the client that generated it. So maybe this flag should be a property of the error object that is separate from the network-transmittable error codes.

dfawley commented 7 years ago

does the error code from the inner RPC get propagated back to my client or does it get turned into Unknown?

This is entirely up to the implementation of the middle service. If it does:

func MyHandler(...) (..., error) {
  if _, err := innerSvc.Call(...); err != nil {
    return err
  }
}

Then the status code will be propagated. This is not typically recommended; the middle service should translate the error into something that makes sense to its client -- usually by switching on the status code returned from the inner service.

bdarnell commented 7 years ago

Is there any update here? We just upgraded from 1.6 to 1.7.1 and it apparently changed some of the implementation details we were relying on to distinguish ambiguous from unambiguous failures (cockroachdb/cockroach#19708). We're probably going to have to revert to 1.6 for now.

I see that the transparent retry change just landed, but remember that this is not enough for us - we also need control over routing. Have the balancer APIs mentioned in https://github.com/grpc/grpc-go/issues/1443#issuecomment-325050548 been stabilized yet? (note that #1388 is still open)

dfawley commented 3 years ago

Closing due to lack of activity and priority.

grpc / grpc-go