penumbra-zone / tower-abci

Tower-based ABCI interface.
MIT License
75 stars 24 forks source link

services err handling with backoff when overloaded #39

Closed tzemanovic closed 10 months ago

tzemanovic commented 10 months ago

This PR adds:

To achieve this, the socket's stream and sink must be held across await points to keep them alive, so I ended up wrapping them in std::sync::Mutex though the access should always be exclusive anyway. Additionally, the request stream is made peekable so that we're not removing requests that must be handled and only popped once a service call is successful.

hdevalence commented 10 months ago

Why does this need to be added to the socket handling in tower-abci? How does tower-abci know what the correct behavior for the application should be? The original design was to allow applications to layer in additional behaviors with Tower middleware:

Because each category of requests is handled by a different service, request behavior can be customized on a per-category basis using Tower Layers. For instance, load-shedding can be added to InfoRequests but not ConsensusRequests, or different categories can have different timeout policies, or different types of instrumentation.

Does that not work? Should we document it better?

tzemanovic commented 10 months ago

@hdevalence unless I'm missing something, I don't think there is any way to handle this in an app. When a service call fails, the error ends up panicking in the unwrap in tokio::spawn(async move { conn.run(read, write).await.unwrap() });. For the mempool and consensus services, there is no other option then to retry later. If there is no response or the response is an exception, it makes CometBFT crash before it even gets to the app, so I think it makes sense to handle overload here. For the info service, there is more flexibility, so maybe we could provide some way to choose how to handle its errors.

Unrelated, I realized that the changes here will force flushing after every request which is not ideal, so I'll try to improve that.

xla commented 10 months ago

Surprised by this change as well, besides @hdevalence point that this is probably not the right point of integration I'd like to understand the motivation better. Could you outline a scenario where backoff/retry is improving QoS between the app and CometBFT while not violating the protocol?

tzemanovic commented 10 months ago

@xla Say you e.g. set .load_shed() on the Info service in an app. The call to the Service is issued by tower-abci and when it fails with tower::load_shed::error::Overloaded, the call to Connection.run returns the error and subsequently panics on the unwrap in conn.run(read, write).await.unwrap(). As this happens internally, there is no way for an app to handle it (even if it was catching panics, the sockets get closed so there is no way to recover from it)

tzemanovic commented 10 months ago

Looking back at the code, I think the most reasonable thing would probably be to only shed info request (and maybe also snapshot) on overload by returning exception response without doing any backoff. For mempool service, it seems that load_shed is not much use not responding to its requests makes cometBFT crash

tzemanovic commented 10 months ago

closing in favor of #40