tower-rs / tower

async fn(Request) -> Result<Response, Error>
https://docs.rs/tower
MIT License
3.54k stars 282 forks source link

Is poll_ready worth it? #626

Open Skepfyr opened 2 years ago

Skepfyr commented 2 years ago

Summary

I don't think that poll_ready() provides enough value for it's extra complexity, and can be removed entirely.

Value of poll_ready

As I understand it (from Inventing the Service trait) poll_ready is a mechanism to provide back-pressure in the case that a service is not immediately available to handle a request. Each Poll::Ready(Ok(())) you receive gives you permission to send exactly one request in via call. Poll::Ready(Err(err)) indicates that this service will never be able to handle requests again.

This allows you to check if a service is able to handle a request before constructing it, which could be a more efficient if holding the request around is expensive. Note that it doesn't help if constructing the request is expensive because once you've called poll_ready you are committed to calling the service (unless the service has stopped but that's incredibly rare in my experience).

Issues with poll_ready

Difficult to adhere to the contract of poll_ready

Imagine a middleware service like Load Shed but with an additional queue to handle bursts of traffic. That service wants to: return ready from poll_ready when the queue is empty, return pending from poll_ready if the queue isn't full, and return ready again if the queue is full so that it can return an overload error from call. Similar surprisingly complex logic appears in services like Steer where it must keep track of which inner service it hasn't received a ready token from.

Not calling call after poll_ready

It's very easy to leak resources if you fail to call the Service and run that future to completion after receiving the go-ahead from poll_ready, see #408 for some existing discussion. Note that this is basically impossible to avoid at the moment, even if you write service.ready().await?.call(request).await?, the future might be dropped after ready has returned but before call has been polled. Additionally, if constructing the request is expensive, it probably can also fail (or is itself asynchronous), which would mean you have to build it before calling poll_ready!

I believe #412 would solve this issue, but at the same time adds more complexity to the API.

Implementing back-pressure without poll_ready

Fundamentally, I think that everything achieved by poll_ready can be achieved without it., and removing it would align much better with Rust's design philosophy of "make easy things easy and hard things possible"[^1].

[^1]: There's probably a better source for this but: Rust API Guidelines - Type Safety

Back-pressure

I am confused by the arguments in #3 and #112, so it's possible I'm missing something here, but moving the code from poll_ready to the start of (the future returned by) call in every Service would not change their behaviour. The only downside to doing this that I can see is that it would prevent you from delaying the construction of the request until the service is ready.

Delayed construction of requests

It's still possible to delay the construction of requests without poll_ready, using the MakeService pattern. You can implement a ReadyService: (()) -> Result<OneshotService, Error>, which gives you the ability to wait for something to be ready before constructing the request. It's easily composable with middleware on both the ReadyService and OneshotService side, and even gives you #412 for free because the OneshotService is, in essence, a token.

Conclusion

One slightly odd result of all this is that Service would become ~identical to FnMut(Request) -> F where F: Future<Output=Result<Response, Error>>, which makes me wonder slightly on what value it's providing. I'm interested to here what I've missed, why poll_ready was implemented in the first place, and whether this is a possible future (pun intended). Sorry this is so long but it's been sitting in my head for a while now and I needed to get it out.

hawkw commented 2 years ago

One significant benefit of poll_ready (and, a reason that not calling call after poll_ready needs to be permitted) is that it permits routing based on readiness. For example, tower::balance can load-balance over only ready replicas, rather than load-balancing over all replicas. Similarly, it permits patterns like sending requests to either a primary or fallback Service depending on whether or not the primary Service is ready (see this code from linkerd2-proxy as an example). These patterns are more challenging to implement in a composable, generic way if waiting for service readiness is moved into the call future.

Also, separating waiting for readiness from handling requests allows backpressure to be pushed to the network layer. For example, because all tower services expose poll_ready, it's pretty easy for a server to implement an accept loop that only tries to accept a new connection when the Service that handles requests on that connection is ready, by awaiting service.ready() before accepting the next connection. This is not easily implemented when waiting for readiness is moved into the future.

I think this may be a better way to explain the discussion around "expensive requests" --- by not accepting incoming connections until the Service is ready to handle them, we can avoid allocating memory for those connections, and any requests we receive on them, when they're not yet able to be handled. Without poll_ready, the service has to keep accepting new connections, and if connections are coming in faster than their requests are being handled, more and more futures waiting for the service to become ready will be created. This can result in performance degradation or the process getting OOM killed.

Does that help explain some of the motivation behind having poll_ready as an integral part of the Service trait?

olix0r commented 2 years ago

As @hawkw mentions, there are some places where we really, truly need poll_ready.

However, I'm definitely sympathetic to the issues you describe about the contract--that cloning invalidates any prior readiness, etc. I've found myself wanting to decompose Service into two traits: one to check readiness and obtain a handle and the other to actually process a request. Something like:

trait PollReady {
  type Call;
  type Error;

  fn poll_ready(&mut self, cx: Context<'_>) -> Poll<Result<Self::Call, Result>>;
}

trait Call<Req> {
  type Response;
  type Error;
  type Future: Future<Output = Result<Self::Response, Self::Error>>;

  fn call(self, req: Req) -> Self::Future;
}

This would use the type system to more elegantly solve the problem that Service::disarm tries to solve--dropping the call handle could release any resources associated with readiness. We've been talking about an approach like this for a while, so it probably warrants some more exploration; but I'm wary that this could end up being more complex.

hawkw commented 2 years ago

At a very high level, I might summarize this by saying that if tower only had the Service::call future, requests could still wait for readiness, but poll_ready allows requesters to make decisions based on readiness.

Potentially, we could also provide a mechanism for making decisions based on a service's readiness by simply exposing a fn ready(&self) -> bool --- this has been proposed previously, I think. However, unlike a simple fn ready, poll_ready also performs the work necessary to drive a service to readiness, and allows a task to be notified when the Service is ready. Performing work in poll_ready permits patterns where a service is made ready without having a request to send, such as having a background task that continually polls a service to readiness. It also provides a place to do work that is not tied to handling a specific request --- for example, tower::balance uses poll_ready to process service discovery updates.

hawkw commented 2 years ago

Related to the split traits approach @olix0r mentions, I remember that a long time ago, tower had a ReadyService trait that could be implemented by services that are always ready. I believe this was ultimately removed to simplify the API (https://github.com/tower-rs/tower/pull/68), but it might be worth looking at again...

davidpdrsn commented 2 years ago

Related to the split traits approach @olix0r mentions, I remember that a long time ago, tower had a ReadyService trait that could be implemented by services that are always ready. I believe this was ultimately removed to simplify the API (#68), but it might be worth looking at again...

Such a trait would be really nice for axum, which faces this exact problem. The workaround I ended on was to use oneshot liberally and assume that services behind routes are always ready.

I agree adding one more trait does add complexity though...

olix0r commented 2 years ago

@davidpdrsn with the split traits, Axum could simply depend on S: Call<Req> + Clone, documenting explicitly that its inner service doesn't do anything to readiness; and we could provide utilities (like Oneshot) that handle turning a Ready into Call. We'd still want most middlewares to implement both so they can be used in stacks that care about readiness. The split would also make it possible for utilities like the ReadyCache to be agnostic of the request type--it would only need a S: PollReady for its inner instances.

I think this is really worth pursuing a bit more deeply, but we'd want to understand more of the ergonomics tradeoffs before committing a substantial amount of time to this.

But I really do think that it's vital to model readiness explicitly. If we just consider HTTP/2, we have stream concurrency limits that we need to honor. A system like tower shouldn't force us to make the innermost part of the stack handle this situation: we explicitly want to surface this state higher up the stack so we can make intentional decisions about whether to eagerly fail requests, route them elsewhere, etc.

Skepfyr commented 2 years ago

Thanks! I'm reasonably convinced that some form of readiness needs to be built in now, but I going to work through things just to check (and hopefully it will help to coalesce some thoughts).

One thing I've realised from reading this is that readiness can only say "I definitely can't handle this" or "I might be able to handle this" (in the general case at least). While I don't think that drastically changes anything, it's useful context I hadn't quite internalised before.

hawkw: One significant benefit of poll_ready (and, a reason that not calling call after poll_ready needs to be permitted) is that it permits routing based on readiness. ... Also, separating waiting for readiness from handling requests allows backpressure to be pushed to the network layer.

I'm not convinced that these require poll_ready (although it would make them simpler), a MakeService like pattern would allow a router or network layer to wait for the actual service to be ready. However it does highlight that the MakeService option wouldn't be quite as transparent as I thought, the outermost service (be it network layer or a client) would have to be aware of it, and you'd probably want an "always ready" wrapper at the end so that everything looked the same.

I have just realised that the MakeService/ReadyService pattern is pretty hard to compose, as you can't just nest them, although that's similar to the awkwardness of composing MakeService's currently.

olix0r: I've found myself wanting to decompose Service into two traits: one to check readiness and obtain a handle and the other to actually process a request.

This is almost exactly composing the MakeService/ReadyService into the type system right? That's probably a plus but I have a suspicion that it's more complicated than poll_ready (with tokens).

hawkw: At a very high level, I might summarize this by saying that if tower only had the Service::call future, requests could still wait for readiness, but poll_ready allows requesters to make decisions based on readiness.

I'm not sure I agree with that, it's perfectly possible to encode readiness in the types as they exist today without poll_ready. However, that's pretty complicated and leads to a bunch of awkward code and it seems that there's many use-cases for readiness based decision making. So I'm leaning towards the answer to "Is poll_ready worth it?" being "Yes", that is, it's not necessary but the complexity of not having it is greater than the complexity of having it (or equivalent).

davidpdrsn: Such a trait would be really nice for axum, which faces this exact problem. The workaround I ended on was to use oneshot liberally and assume that services behind routes are always ready.

I'm not convinced that it would be nice, do you really not want to be able to route to services that can sometimes not be ready? I think this is just a case where you need to lean into readiness meaning "I definitely can't handle this" or "I might be able to handle this". You can return not ready if none of the routes are ready, and otherwise say ready. Otherwise, as that link mentions you'd need a horribly complicated system to tell the caller what's ready (essentially you'd need to see the request, which is what call does).

stuhood commented 2 years ago

A (maybe?) simpler alternative to two traits would be to add a generic parameter to Service which was its "call token" type, and to then return the token type from poll_ready, and take it as an argument to call. Services which didn't have any state could have a token type of unit (()), while a semaphore could directly use its permit type, or a wrapping struct. Perhaps that runs into trouble when stacking services though, since your token type would be a ... GAT?

hawkw commented 2 years ago

@stuhood that approach was proposed in https://github.com/tower-rs/tower/issues/412. At the time, some people looked into what an implementation of that design would look like, and you're correct that, unfortunately; it probably can't be implemented without GATs. Which is a shame, because it would be quite nice otherwise.

Perhaps that proposal can be re-visited if stable Rust ever gets GATs.

stuhood commented 2 years ago

Got it: thanks for the reference. I wonder how @Skepfyr and @olix0r would feel about closing this issue in favor of #412 then? A two trait approach would probably be best discussed as an alternative to waiting for GATs.

Skepfyr commented 2 years ago

poll_ready interacts really poorly with GATs, I had an attempt with a Service trait that looked like this: (ReadyToken = () gives you a pure GAT-ified version of the Service trait)

pub trait Service<Request> {
    type Response;
    type Error;
    type ReadyToken;
    type Future<'a>: Future<Output = Result<Self::Response, Self::Error>>
    where
        Self: 'a;
    fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<Self::ReadyToken, Self::Error>>;
    fn call(&mut self, token: Self::ReadyToken, req: Request) -> Self::Future<'_>;
}

ReadyToken itself can't be a GAT as that would prevent you from calling call as it would have an exclusive reference to self. poll_ready can't take &self as you shouldn't be able to call poll_ready in multiple places/times (what do you do if you get 2 different contexts?). call could take &self but you'd have to stock up on ReadyTokens first before starting to call call. As a &mut you can't have multiple concurrent calls without cloning the Service which feels like a really harsh limitation. (It also makes futures become self-referential a lot more making them harder and less safe to implement).

Talking to @hawkw on Discord suggests I misunderstood the GAT requirement, it seems that the idea was that ReadyToken needs to be a GAT but as above that doesn't work. Although I did dig out this suggestion from @jonhoo from 2021-04-01 that's incredibly similar to what's been mentioned already:

trait Service {
  type Token<'s>: ServiceToken;
  fn poll_ready<'s>(&'s mut self) -> Self::Token<'s>;
}
trait ServiceToken<Request> {
  fn call(self, req: Request) -> Self::Future;
}
Skepfyr commented 2 years ago

I had a go at implementing tokens (without GATs) in #631, and I'm much less convinced that the idea in #412 can work, I think the split traits (with GATs?) may be the only way of sensibly encoding readiness.

Skepfyr commented 2 years ago

I've been mulling over this for a while now and Amos' recent(-ish) article (https://fasterthanli.me/articles/futures-nostalgia) has pushed me to come back to it.

I've come to the conclusion that my main issue with the current design is that it doesn't work, specifically concurrency limit has a massive footgun:
If you run the code in this gist which uses the tower ConcurrencyLimit service to limit the number of concurrent requests to 2, it will appear to work correctly. However, if you open 2 connections (eg using ncat) but do not send a request on them, then any more connections will hang indefinitely because the first 2 have reserved but not used their permits, and the rest get stuck waiting for it to become ready.

This is kinda the scenario that disarm in #408 is designed for, but in this situation there isn't a very obvious point for hyper to disarm the service. It clearly should call poll_ready before it gets the request, otherwise poll_ready is mostly pointless, but it must disarm after a while or stalled connections withhold resources from active ones.

I realise that so far I've mostly just been pointing out issues and not really suggested anything constructive so here's a proposal for what the Service trait could look like, and some reasoning as to why I think it solves most of the problems.

pub trait Service<Request> {
    type Response;
    type Error;
    type Future: Future<Output = Result<Self::Response, Self::Error>>;

    fn is_ready(&self) -> Result<bool, Self::Error>;
    fn call(&mut self, req: Request) -> Self::Future;
}

is_ready is now just a normal function that returns if the service is probably able to accept a new request around the time that is_ready is called. You can call it as much or as little as you like (it doesn't reserve any resources); it's essentially a hint. It also can return an error if something has gone wrong although I'm not sure if it should do that. call now has to do some of the waiting that would have happened in poll_ready but I'm going to guess that there's no meaningful difference between waiting for a remote and to reply and waiting for something local to be ready.

On driving a service to readiness: this removes the ability for trivially "driving" a service while you don't have a request. In general you only call poll_ready when you're "about to have" a request, either you already have one and are calling poll_ready because you have to, or a connection is open and you're expecting one very shortly (as above if you're not expecting one shortly then you shouldn't call poll_ready). This means you can't trust poll_ready to be called regularly (what if all the service are ready and waiting), so poll_ready can't be used for anything time-sensitive like keepalives, but it also means that any work you do in poll_ready is probably delaying a request by the same amount as if you had done the work in call.

On backpressure: poll_ready is also used as a mechanism to provide backpressure, it enables you to not submit new requests if the service isn't ready to handle them. is_ready provide some of this but for clients being able to wait for readiness is helpful. However, none of this actually works; it's perfectly possible for a service to be "ready" but still processing requests slower than it's receiving them, so you can't rely on poll_ready providing back_pressure. If you want to implement backpressure properly, at the moment you must combine information on in-flight requests and readiness, which isn't obvious in the current API but is essentially the only option with is_ready.

mattklein123 commented 2 years ago

Sorry to hijack this issue with a question, but I wanted to come back to a comment in the original description:

Note that this is basically impossible to avoid at the moment, even if you write service.ready().await?.call(request).await?, the future might be dropped after ready has returned but before call has been polled.

I'm new to the ecosystem but I want to make sure I fully understand this. Is this possible because the executor isn't required to immediately poll the future returned by call()?. Does this happen in practice or do some executors (tokio specifically) always do an immediate poll to avoid this case?

The reason I'm asking this is I have written code like this (a shared upstream/egress service that is being called in response to downstream/ingress requests). There is definitely the chance that the downstream/ingress request may drop at any time, so now I'm concerned that there is an implicit leak in my code and I'm not quite sure what to do about it.

Is there a pattern I am supposed to use instead of service.ready().await?.call(request).await?? (It's clear to me why calling ready() and then pending for some other reason and not calling call() immediately can be problematic.)

(I originally found this issue as I too was confused about the existing trait design.)

davidpdrsn commented 2 years ago

I think it's a good pattern to follow. I haven't run into any leaks myself because of it. Rust code tends to be pretty good at freeing resources when they're dropped.

mattklein123 commented 2 years ago

I think it's a good pattern to follow. I haven't run into any leaks myself because of it. Rust code tends to be pretty good at freeing resources when they're dropped.

But dropping is not the issue here, right? The issue is if some shared state is acquired without calling call() it will leak. I don't think dropping fixes that unless each service is carefully written that when it is dropped it will release shared state. And maybe this is the case for officially supported layers?

For example: https://github.com/tower-rs/tower/blob/5064987ffec4420ef6ea6de649606286370e42bc/tower/src/limit/concurrency/service.rs#L18-L23

When dropped will release the permit, but if someone wrote a layer that say used a counter that expected to be balanced via call(), that call() might never happen even in the above call sequence?

I'm mostly just trying to make sure I 100% understand the implicit assumptions that are being made here and it sounds like it is a requirement that when a service drops it must release any resources from a missing call()? (Again assuming the pattern of fast clones of services that might be using shared underlying state like connection pools or whatever.)

davidpdrsn commented 2 years ago

But dropping is not the issue here, right? The issue is if some shared state is acquired without calling call() it will leak.

Tokio'a semaphore releases the permit when it's dropped so that's not an issue with ConcurrencyLimit.

I'm sure it's theoretically possible if you write your service poorly but I'm just saying I've never heard about it happening.

hawkw commented 2 years ago

@mattklein123:

Typically, any resource acquired in poll_ready and consumed in call will be moved into the returned call future. For example, in the ConcurrencyLimit service you referenced above, the permit is moved into the returned call future: https://github.com/tower-rs/tower/blob/5064987ffec4420ef6ea6de649606286370e42bc/tower/src/limit/concurrency/service.rs#L83-L92

If that future is canceled before it completes (say, by a timeout or something), the permit is still dropped when the future is dropped.

But dropping is not the issue here, right? The issue is if some shared state is acquired without calling call() it will leak. I don't think dropping fixes that unless each service is carefully written that when it is dropped it will release shared state. And maybe this is the case for officially supported layers?

If shared state is acquired in poll_ready, then the assumption is that the shared state is released when the service is dropped if the shared state is unconsumed by a call call. If the service is cloned and driven to readiness, but never called, that service clone will be dropped (e.g. if the future that drives the service to readiness and then calls it is cancelled), releasing the shared state.

I think the point @davidpdrsn is trying to make is that msot shared resource types will provide some form of RAII guard, such as a MutexGuard or a semaphore Permit, and dropping those RAII guards will release the resource. If the service must acquire the resource in poll_ready and consume it in call, then it follows that the acquired RAII value will be stored within the Service instance in between poll_ready and call, so it will be dropped when the Service clone is dropped.

If you're implementing your own shared resource (e.g. the balanced counter you described in your comment), yes, you would be expected to use a similar RAII construct to ensure that shared resource is released in drops. But, this is the general Rust idiom for such code, which is what I think David is getting at. You would need to do this regardless of the design of the Service trait --- if you were not using tower and implementing a similar request counter that's incremented when a request begins and decremented when a request completes, you would need to ensure the counter is decremented when the request Future is dropped, since Futures may be dropped without completing[^1]. This is less a consequence of tower's design and more a consequence of the design of Rust async in general.

[^1]: This is what allows us to do things like attach timeouts, graceful shutdown signals, etc to arbitrary Futures in a generic way.

mattklein123 commented 2 years ago

@hawkw thanks for the lengthy explanation. Yes, this all makes sense to me and now that I understand the contract it's not difficult to adhere to. I guess my main comment as someone new to the ecosystem is that it was not easy for me to get to this understanding of how poll_ready() interacts with call() and per the original issue description there are some implicit foot guns (especially as it was definitely not clear to me that service.ready().await?.call(request).await? could result in call() not getting polled after ready() resolves).

I can propose a documentation improvement PR for poll_ready() if that would be accepted?

hawkw commented 2 years ago

I can propose a documentation improvement PR for poll_ready() if that would be accepted?

Sure, we'd love any docs improvements!

mattklein123 commented 2 years ago

I took a stab here: https://github.com/tower-rs/tower/pull/662

Michael-J-Ward commented 2 years ago

It appears like tower has inspired an imitator. https://github.com/cloudwego/motore

I'm sharing here because they might have an interesting design option that I haven't seen considered here.

pub trait Service<Cx, Request> {
    /// Responses given by the service.
    type Response;
    /// Errors produced by the service.
    type Error;

    /// The future response value.
    type Future<'cx>: Future<Output = Result<Self::Response, Self::Error>> + Send + 'cx
    where
        Cx: 'cx,
        Self: 'cx;

    /// Process the request and return the response asynchronously.
    fn call<'cx, 's>(&'s mut self, cx: &'cx mut Cx, req: Request) -> Self::Future<'cx>
    where
        's: 'cx;
}
LucioFranco commented 2 years ago

I got to a similar spot

#![feature(type_alias_impl_trait)]

use std::future::Future;

pub trait Service<Req> {
    type Res;
    type Err;
    type Future<'a>: Future<Output = Result<Self::Res, Self::Err>> + 'a
    where
        Self: 'a;

    fn call<'a>(&'a self, req: Req) -> Self::Future<'a>;
}