Cancellation race - Githubissues

bubenheimer commented 2 years ago

There seems to be a race between request & response flows on the client when cancelling the response flow in a bidi scenario. This can result in the response flow processing the server's "CANCELLED" StatusException, for example in a retry() or catch() operator, even though it should not be seeing it after cancellation.

The culprit would seem to be this code here, which liberally catches arbitrary Exceptions from emit() inside a flow() operator, but delays passing on the exception to the response flow:

https://github.com/grpc/grpc-kotlin/blob/5fb5efffb108d71e711ae92548c32c69b6439eff/stub/src/main/java/io/grpc/kotlin/ClientCalls.kt#L322-L330

I'd hope that the problem would be easy to spot for someone with an intimate knowledge of the codebase. What I am seeing is this:

Response flow collection getting cancelled from the outside via Job.cancel().
Request flow getting cancelled by grpc-kotlin machinery, resulting in RPC cancellation to the server.
Server responds with CANCELLED error status.
retry() block in response flow sees and processes server's CANCELLED StatusException. This is not valid, as response flow is already cancelled via CancellationException.

The server runs locally, and the issue is intermittent, so I'd guess that the request-response cycle needs to be pretty fast to make this happen, while coroutines machinery runs slowly at the same time due to heavy load. I am running request & response flows on Dispatchers.IO because of associated database work, which may be a factor in reproducing, too; IO dispatching might be queued.

bubenheimer commented 2 years ago

I imagine that in absence of a retry() operator this might throw StatusException rather than whatever the original exception was that caused the cancellation, but I don't really know, as the whole thing is borked. I also cannot surface the original exception from inside the retry operator, so the original exception is lost, unless maybe it shows up again downstream for a catch block to handle, no idea.

bubenheimer commented 2 years ago

I also will point out that the current flow cancellation approach seems to take an undue amount of time. My logs indicate that even the shortened timeframe from request flow completed to response flow completed takes about 100 milliseconds, during which there is a complete two-way communication happening about the client going away. This implies that the job is being blocked for about 100 milliseconds before it can finally complete. In the context of structured concurrency I'd consider this very slow, consider that the overarching structure of jobs & scopes may be much larger than just this one scope, and potentially bringing a large client-side teardown regime to a halt for that long via a NonCancellable block seems awful.

When the response flow is being cancelled on the client it does not make much sense to wait for the server to send us a confirmation whose result we throw away, it should be a one-way call to the server, done. Perhaps that is the intention, but currently the flows seem to be hanging around long enough to make it look different.

bubenheimer commented 1 year ago

This sounds related: https://github.com/grpc/grpc-kotlin/pull/344

grpc / grpc-kotlin

Cancellation race #318