Closing a session should not block

Menduist commented 3 years ago

If we follow the logic of TCP etc, a close call on a session should not block, and should instead close in the background (for instance with asyncSpawn)

It would also make sense to add a timeout to the close operation, since a non-responsive peer could block indefinitely

dryajov commented 3 years ago

closing is definitely non blocking - what you see is the async recv pulling data from the socket until we read response to the close message, but it's async and not blocking.

We've come to learn that timeouts are (almost) always better handled by the calling application and not internally. In other words, if a timeout for close is desired then the calling application can always do something like:

  await ws.close().wait(1.minutes)

This should cancel the close on expiration and teardown everything associated with the session.

That said, cancellation is not tested and it's definitely something to add sooner rather than later, along with chronos trackers (#72).

arnetheduck commented 3 years ago

does close raise? wait itself raises on timeout, a more convenient option is often withTimeout

Menduist commented 3 years ago

Sorry, when I say "blocking", I mean in the context of async, but I should say "closing is not instantaneous"

So with the current setup of nim-websock, if an application wants to close a connection, it should do something like asyncSpawn(peer.disconnect(FaultOrError)) (which is actually done in some places in nimbus)

But it's not done everywhere, here it's simply awaited in a conn event handler, and the conn event handler is awaited inside the connection manager, which will now block whichever function called triggerConnEvent.

So we should either:

Tell explicitly that closing a connection can take a few seconds, and that the application should handle it (using asyncSpawn)
Handle it in libp2p
Handle it in nim-websock

For me, 3 is the best option. 1 seems risky, because it's the type of issues that could clog up an application main loop unexpectedly (especially since this only happens with user space transports!). 2 could also work, but imo it's better to use the same logic here as the others transports stack, which is to have an instantaneous close, and cleanup in the background.

And so if we go with 3, nim-websock will have to do something like

proc close(s: Sock) =
  if s.closed: return
  s.closed = true
  asyncSpawn(actuallyClose())

So the end application won't be able to specify a timeout to actuallyClose (and it shouldn't, just like tcp or other transports, the timeout should be a sockopt basically), so nim-websock should be doing it internally.

To recap, I think this library should copy as much as possible the behavior of kernel space protocols, to avoid unexpected behavior for the applications using it. And in kernel space network protocols, close is always instantaneous, unless specific options are used (eg, SO_LINGER which waits for pending data to be sent)

arnetheduck commented 3 years ago

consider that the OS also doesn't close things at once - if you are waiting for a "read" event, you will receive the notification after the socket has closed if you close it from somewhere else - the one thing we've done in libp2p is that we tend to ignore errors during close and drop the future / spawn it, even if say there are writes queued etc

dryajov commented 3 years ago

"closing is not instantaneous"

libp2p assumes half-closed (mplex/yamux/etc) streams, meaning that a full close is not performed until all data has been sent/read, so there is (almost) always a delay between calling close and the actual close completing. On top of that, the application doesn't have access to the actual underlying socket/connection, so it is always calling close on an LPChannel (or something else in the future, i.e. yamux, etc...)

When talking about closing actual raw socket/connection, closing happens only when the transport, connection manager or muxer is shutting down - nothing else has access to it right now.

On top of that, there is already a failsafe timeout at the Connection level - https://github.com/status-im/nim-libp2p/blob/master/libp2p/stream/connection.nim#L83-L87. The timeout can be hooked and cleanup can be performed there, for example ChronosStream calls it's own close explicitely once the innactivity timeout expired - https://github.com/status-im/nim-libp2p/blob/master/libp2p/stream/chronosstream.nim#L46-L54.

All this to say that fundamentally nothing has changed with the introduction of websockets, the only difference right now is that when calling close on theWSSession, the application should make sure that it's done with a timeout. So the actual websocket stream implementation needs to do the aforementioned close().wait()/close().withTimeout() to prevent it from hanging.

In the majority of cases, when there is a public api available, as it is the case here with close, you want a timeout attached to the public api. This is done mostly because usually, the calling application has a lot more context to decide how and when (or if at all), to trigger the timeout and initiate a cancelation teardown.

There are some potential exceptions to this rule, for example, it's impossible to attach a timeout to an accept type api, simply because it operates under a different set of assumptions - i.e. don't return unless ready or finished, but for most other cases, %99 of the times you'd want a timeout attached to the public api rather than buried somewhere in the internals of the call sequence.

dryajov commented 3 years ago

As a complement and to confirm that we're doing it right, here is an example of how to perform close using Berkley sockets as an analogy - https://datatracker.ietf.org/doc/html/rfc6455#section-7.1.1

As an example of how to obtain a clean closure in C using Berkeley sockets, one would call shutdown() with SHUT_WR on the socket, call recv() until obtaining a return value of 0 indicating that the peer has also performed an orderly shutdown, and finally call close() on the socket.

This is exactly what close is doing right now.

Menduist commented 3 years ago

Discussed at today's meeting, In the end it's just a naming misunderstanding: for me close should be instantaneous, and there should be a closeWait which actually waits for the connection to be closed. We might just rename close to closeWait to stay consistent with TCP, but not sure.

And I'll add a timeout inside libp2p

dryajov commented 3 years ago

for me close should be instantaneous, and there should be a closeWait which actually waits for the connection to be closed.

Just to clarify, in chronos, calling close is synchronous, but you are expected to call join right after, in essence, closeWait() does exactly that, calls close() and then, await join().

status-im / nim-websock

Closing a session should not block #78