Open hawkw opened 6 years ago
I'm happy to work on implementing this, but I suspect @carllerche will have some opinions on how we ought to do it.
I think that this is a great idea.
My original thought was to not do backoff in the reconnect middleware, instead keep back off as part of Retry
(which doesn't exist yet) and then Retry<Reconnect<...>>
would provide the backoff reconnect behavior.
However, after thinking some more about it, I don't think that this is necessarily ideal because Retry
will incur some amount of additional overhead necessary to keep a handle to the original request and performing the retry of the request. This overhead probably isn't necessary int he reconnect w/ back off case.
Given this, implementing the logic directly in tower-reconnect
is probably the right first (and maybe final) step. Once we tackle Retry
we can then think about what, if anything, can be shared between the two.
@hawkw if you want to try to take a stab at this, go for it. The one nit with your original list is:
take a stream of durations to use as backoffs
This probably won't be a stream in the futures sense. Instead, the type (Backoff
?) will probably implement IntoIterator
such that the Item = Duration
.
Maybe @olix0r or @danburkert have additional thoughts on the matter.
This relates a little to https://github.com/tower-rs/tower/issues/14.
My original thought was to not do backoff in the reconnect middleware, instead keep back off as part of
Retry
(which doesn't exist yet) and thenRetry<Reconnect<...>>
would provide the backoff reconnect behavior.However, after thinking some more about it, I don't think that this is necessarily ideal because
Retry
will incur some amount of additional overhead necessary to keep a handle to the original request and performing the retry of the request. This overhead probably isn't necessary int he reconnect w/ back off case.
My thought is that we provide a "unit" or "empty" backoff type, and a Retry
with the empty/unit backoffs would just do the current Reconnect
behaviour of immediately failing the request on connect errors, because the backoffs are always exhausted.
Instead, the type (
Backoff
?) will probably implementIntoIterator
such that theItem = Duration
.
Yeah, that seems right.
A question that came up up while working on this is: what timer should be used for the backoffs?
In Conduit, I've been working on an abstraction over timer implementations (runconduit/conduit#480) that allows a mock timer to be injected for testing, and we'd expect that if I configure conduit to use a mock timer, backoffs will also wait based on the mock timer rather than the default timer. Furthermore, we'd ideally want users who are using tokio-timer
to be able to use that timer for backoffs, without requiring the tokio-timer
dependency. This implies to me that we might want to move the timer facade work I've been doing from Conduit to tower
. Does that seem reasonable?
I would prefer to not introduce a Timer
trait yet. Traits come with ergonomic overhead. The next iteration of tokio-timer can handle the requirements of a being able to "mock" out timers.
I do have some thoughts on retry strategies. I think the strategy @hawkw outlined in the first comment is valid, but it does have the downside that every error becomes a timeout error. This can be mitigated with some careful error management, but it's pretty tricky to do in an ergonomic way and without throwing away the intermediate errors.
The retry strategy also needs to be designed with speculative / hedged connections in mind. E.g. if the service has three replicas which you can choose from, you may not want to spend your full timeout attempting to connect to one of them. Perhaps this is better solved at a higher-level, though.
@danburkert I think that request-level retry strategies should be handled at a higher level. Reconnect should really only be addressing the layer 4 concern of establishing a transport. I'd expect this to be wrapped with a connection pool, the connection pool to be wrapped with a balancer, and then retries to wrap the balancer. I'm in the process of writing up some general plans for https://github.com/runconduit/conduit/issues/475 -- would love your input once that's up.
Ah ok, seems I misunderstood the requirements here. If you are reconnecting just the layer 4 transport (say, TCP), how do you handle application-level negotiations that need to take place like TCP, SASL, and custom handshakes?
In general doing anything more complex than layer 4 means that the reconnect logic needs to be able to distinguish between retriable and non-retriable errors.
Currently, the
tower-reconnect
middleware tries to reconnect immediately if an error occurs while connected, but propagates all errors if an error occurs while connecting. We should modify this middleware to: