twitter / finagle

A fault tolerant, protocol-agnostic RPC system
https://twitter.github.io/finagle
Apache License 2.0
8.79k stars 1.46k forks source link

Http client, occasionally throws `CancelledRequestException` when on high load #658

Closed szysas closed 7 years ago

szysas commented 7 years ago

Finagle based http client, occasionally throws CancelledRequestException when making requests to another Finagle based http server. (Both used with finatra)

Expected behavior

Http client does retries.

Actual behavior

Fails

Failure(request cancelled. Remote Info: Upstream Address: /10.164.15.66:64608, 
Upstream Client Id: Not Available, Downstream Address: /10.164.177.11:8170, 
Downstream Client Id: xxx, Trace Id: eb59a81eb71cb99a.a69dc19cf5903524<:eb59a81eb71cb99a, flags=0x02) 
with RemoteInfo -> Upstream Address: /10.164.15.66:39357, 
Upstream Client Id: Not Available, Downstream Address: /10.164.177.13:8170, 
Downstream Client Id: xxx, Trace Id: 8c5214be4c5974c9.b6b92bd49ae2df33<:8c5214be4c5974c9

Steps to reproduce the behavior

This happens only in hight load, not sure how else to reproduce it.

roanta commented 7 years ago

This means that your client is interrupting requests (usually because timeouts are triggering) and the interrupt is propagating to the server [1].

[1] https://twitter.github.io/finagle/guide/FAQ.html#what-are-cancelledrequestexception-and-cancelledconnectionexception

kevinoliver commented 7 years ago

One other thing to note if you want retries, you need to let Finagle know that it is ok. The reason is that Finagle does not know your application level behavior out of the box and cannot know whether or not retrying that request is safe. Good news is that you can tell Finagle about this using ResponseClassifiers.

szysas commented 7 years ago

I still would like to understand use case that this exception is thrown, is it only because of timeout (http response did not arrive in time)? In the source code I could only find here, that is related to client: https://github.com/twitter/finagle/blob/143f6f0f83765783808a11f990179d881daecb66/finagle-base-http/src/main/scala/com/twitter/finagle/http/exp/GenStreamingSerialServerDispatcher.scala#L128

What is the reason that retry on this exception is not set by default? (I'm guessing you guys had a good reason)

kevinoliver commented 7 years ago

What is the reason that retry on this exception is not set by default?

we can't know what the application behavior is for any particular request. lets say that this request increments a counter in a database table then gets killed. that is not safe to retry.

we choose to use safe defaults.

but maybe your request is idempotent and is safe. in that case, you can let finagle know.

szysas commented 7 years ago

Thanks. (BTW, great job with this library)