Add a timeout for service discovery

ThomWright commented 1 year ago

Context

As part of looking into timeouts in various parts of the stack in TrueLayer/ginepro#37 I found that we have no timeout for DNS resolution. As currently implemented, it's possible for requests to hang forever if DNS never resolves and no services are discovered.

Applications can add their own timeouts around requests, but I think it can be helpful for underlying libraries to offer timeouts for things like TCP connection and DNS resolution (or general service discovery), especially as fallbacks for when higher level timeouts are forgotten!

Changes

I've taken a shot at solving this, but I'm not very familiar with writing lower level async code using poll etc. rather than async/await. I can't say I'm confident that this is the best approach or that I haven't missed something.

With that in mind:

Is there a better way to solve this problem?
Is the change in the type of Balance::Future a problematic breaking change?
Are there any more docs or test cases I should consider adding?

I should probably have raised an issue first before diving into a solution, but it's Christmas and I got carried away 😅

seanmonstar commented 1 year ago

I've not had the time to review this, but I'm case it helps, I'd expect that the balance layer wouldn't need any special knowledge of timeouts, and instead you could fit it at a layer above or below.

ThomWright commented 1 year ago

I've not had the time to review this, but I'm case it helps, I'd expect that the balance layer wouldn't need any special knowledge of timeouts, and instead you could fit it at a layer above or below.

Thanks, I was kind of wondering if the timeout logic could be extracted somehow (it would be nice), but wasn't sure how to do it.

At a layer above I could imagine a generic timeout on poll_ready(), but 1. it was discussed in #648 and there didn't seem much appetite and 2. I don't think it would discriminate between service discovery having completed (e.g. DNS taking a long time) and some other cause of non-readiness (e.g. being rate limited at a lower layer). But maybe there's no good reason to discriminate and I got a bit too fixated on DNS timeouts!

I'm not sure how it would be applied at a layer below.

On another note: I just noticed that UnreadyService in tower/src/balance/p2c/service.rs looks like dead code. From I can see that concept now lives in ReadyCache as Pending. I can remove that in a separate PR, unless I'm missing something.

LucioFranco commented 1 year ago

I think dns timeouts should probably exist within the dns layer itself. So if its hyper we should add timeouts there which should then trickle back up. This would also help users that don't use tower.

olix0r commented 1 year ago

In linkerd, we use composable middlewares like failfast to enforce readiness timeouts.

I agree that it's not really appropriate to add this functionality to the balancer directly. It may be worth upstreaming our failfast/loadshed modules, though.

ThomWright commented 1 year ago

I think dns timeouts should probably exist within the dns layer itself. So if its hyper we should add timeouts there which should then trickle back up. This would also help users that don't use tower.

Yeah, I agree. Ideally this timeout would be elsewhere. I wonder if a better description of the problem I'm working around would be helpful in explaining why I ended up here.

Ginepro has two resolution strategies: Eager and Lazy. The Eager variant will resolve a set of initial IPs with a timeout, which works nicely for the initial DNS resolution. However, both strategies run a background task to do periodic DNS probes and send updates to the balancer. When the balancer ends up in a state with no services, there's currently no way to time out on DNS itself since it's outside of the request flow.

I first tried solving this problem within ginepro itself and wasn't happy with my solution. Since the consensus seems to be that the balancer isn't an appropriate place for this timeout (fair), I reckon I'll head back to ginepro and give it a second attempt. I learned a few things while writing this which I think will help.

In linkerd, we use composable middlewares like failfast to enforce readiness timeouts.

I agree that it's not really appropriate to add this functionality to the balancer directly. It may be worth upstreaming our failfast/loadshed modules, though.

The FailFast middleware looks like a good candidate for upstreaming to me.

Linkerd's LoadShed looks pretty similar to tower's own LoadShed.

Anyway, thank you all for the feedback. I'm happy to close this PR given this doesn't seem to be the right approach.

tower-rs / tower

Add a timeout for service discovery #715

Context

Changes