libp2p / rust-libp2p

The Rust Implementation of the libp2p networking stack.
https://libp2p.io
MIT License
4.56k stars 949 forks source link

swarm: connection management behaviours can lead to inconsistent state #4870

Open thomaseizinger opened 11 months ago

thomaseizinger commented 11 months ago

In #4777, we discovered that a "connection management" behaviour like libp2p-connection-limits can lead to inconsistent state if it is not applied "first" in the tree of behaviours.

That is because the handle_ functions take &mut self and thus allow modifying the state of a NetworkBehaviour, potentially pre-loading a handler with state. If a behaviour deeper in the tree ends up rejecting the connection, that state is lost.

Preloading handlers is useful to ensure they don't "instant timeout" based on keep-alive after a connection has been established.

The underlying design issue here is that the composition of handle_ functions leads to an inconsistent view across the state of NetworkBehaviours. By having them take &self, this issue would not occur (unless the user users internal mutability but that we can't guard against).

Somehow, we'd need to find a way to still pass data to the handler but only in case the connection is fully established. Perhaps we should find a way to mitigate the "instant shutdown" in a different way by not starting the Connection task until the ConnectionEstablished event has been dispatched to the behaviour?

mxinden commented 11 months ago

Thank you for documenting this. Unfortunate that we did not catch this earlier.

Two alternative suggestions:

  1. Only allow denying a pending connection, not an established connection. In other words change the signature of handle_established_inbound_connection and handle_established_outbound_connection. Arguably a recess for connection management, though potentially worth the removal of this footgun.

    diff --git a/swarm/src/behaviour.rs b/swarm/src/behaviour.rs
    index c25b14e75..434fe5c3c 100644
    --- a/swarm/src/behaviour.rs
    +++ b/swarm/src/behaviour.rs
    @@ -149,13 +149,13 @@ pub trait NetworkBehaviour: 'static {
         fn handle_established_inbound_connection(
             &mut self,
             _connection_id: ConnectionId,
             peer: PeerId,
             local_addr: &Multiaddr,
             remote_addr: &Multiaddr,
    -    ) -> Result<THandler<Self>, ConnectionDenied>;
    +    ) -> THandler<Self>;
    
         /// Callback that is invoked for every outbound connection attempt.
         ///
         /// We have access to:
         ///
         /// - The [`PeerId`], if known. Remember that we can dial without a [`PeerId`].
    @@ -185,13 +185,13 @@ pub trait NetworkBehaviour: 'static {
         fn handle_established_outbound_connection(
             &mut self,
             _connection_id: ConnectionId,
             peer: PeerId,
             addr: &Multiaddr,
             role_override: Endpoint,
    -    ) -> Result<THandler<Self>, ConnectionDenied>;
    +    ) -> THandler<Self>;
  2. At a first glance, one could call ConnectionHandler::poll_close on a denied ConnectionHandler. This would allow the ConnectionHandler to pass the pre-loaded events back to the NetworkBehaviour. Though that is not possible, given that in Swarm we don't have access to each ConnectionHandler, but only the root ConnectionHandler, which never comes into existence, given that the connection is being denied. Documenting here for completeness sake.
thomaseizinger commented 11 months ago

I don't think (1) is a suitable option as it makes it impossible to implement block lists for example.

In combination with a default, non-zero idle timeout, perhaps just passing data to the handler in ConnectionEstablished is the best option?

Or, we simply specialise the "connection just got established"-case. Arguably, a user should be able to react to the SwarmEvent::ConnectionEstablished and use it, even if no NetworkBehaviour "needs" the connection.

I think you suggested something like this at some point @mxinden when we talked about the idle-connection-timeout.

What do you think of a 1(?)-second, unconfigurable delay before considering a just-established connection idle? It feels a bit wrong because picking a delay is inventiably sensitive to the environment we are running it. Yet, it seems the current situation is not good and I'd like to change the above to &self.