Open Menduist opened 3 years ago
closing is definitely non blocking - what you see is the async recv
pulling data from the socket until we read response to the close message, but it's async
and not blocking.
We've come to learn that timeouts are (almost) always better handled by the calling application and not internally. In other words, if a timeout for close
is desired then the calling application can always do something like:
await ws.close().wait(1.minutes)
This should cancel the close
on expiration and teardown everything associated with the session.
That said, cancellation is not tested and it's definitely something to add sooner rather than later, along with chronos trackers (#72).
does close
raise? wait
itself raises on timeout, a more convenient option is often withTimeout
Sorry, when I say "blocking", I mean in the context of async, but I should say "closing is not instantaneous"
So with the current setup of nim-websock
, if an application wants to close a connection, it should do something like
asyncSpawn(peer.disconnect(FaultOrError))
(which is actually done in some places in nimbus)
But it's not done everywhere, here it's simply awaited in a conn event handler, and the conn event handler is awaited inside the connection manager, which will now block whichever function called triggerConnEvent
.
So we should either:
For me, 3 is the best option. 1 seems risky, because it's the type of issues that could clog up an application main loop unexpectedly (especially since this only happens with user space transports!). 2 could also work, but imo it's better to use the same logic here as the others transports stack, which is to have an instantaneous close, and cleanup in the background.
And so if we go with 3, nim-websock
will have to do something like
proc close(s: Sock) =
if s.closed: return
s.closed = true
asyncSpawn(actuallyClose())
So the end application won't be able to specify a timeout to actuallyClose
(and it shouldn't, just like tcp or other transports, the timeout should be a sockopt basically), so nim-websock should be doing it internally.
To recap, I think this library should copy as much as possible the behavior of kernel space protocols, to avoid unexpected behavior for the applications using it. And in kernel space network protocols, close
is always instantaneous, unless specific options are used (eg, SO_LINGER
which waits for pending data to be sent)
consider that the OS also doesn't close things at once - if you are waiting for a "read" event, you will receive the notification after the socket has closed if you close it from somewhere else - the one thing we've done in libp2p is that we tend to ignore errors during close and drop the future / spawn it, even if say there are writes queued etc
"closing is not instantaneous"
libp2p assumes half-closed (mplex/yamux/etc) streams, meaning that a full close is not performed until all data has been sent/read, so there is (almost) always a delay between calling close
and the actual close completing. On top of that, the application doesn't have access to the actual underlying socket/connection, so it is always calling close
on an LPChannel
(or something else in the future, i.e. yamux, etc...)
When talking about closing actual raw socket/connection, closing happens only when the transport, connection manager or muxer is shutting down - nothing else has access to it right now.
On top of that, there is already a failsafe timeout at the Connection
level - https://github.com/status-im/nim-libp2p/blob/master/libp2p/stream/connection.nim#L83-L87. The timeout can be hooked and cleanup can be performed there, for example ChronosStream
calls it's own close
explicitely once the innactivity timeout expired - https://github.com/status-im/nim-libp2p/blob/master/libp2p/stream/chronosstream.nim#L46-L54.
All this to say that fundamentally nothing has changed with the introduction of websockets, the only difference right now is that when calling close
on theWSSession
, the application should make sure that it's done with a timeout. So the actual websocket stream implementation needs to do the aforementioned close().wait()/close().withTimeout()
to prevent it from hanging.
In the majority of cases, when there is a public api available, as it is the case here with close
, you want a timeout attached to the public api. This is done mostly because usually, the calling application has a lot more context to decide how and when (or if at all), to trigger the timeout and initiate a cancelation teardown.
There are some potential exceptions to this rule, for example, it's impossible to attach a timeout to an accept
type api, simply because it operates under a different set of assumptions - i.e. don't return unless ready or finished, but for most other cases, %99 of the times you'd want a timeout attached to the public api rather than buried somewhere in the internals of the call sequence.
As a complement and to confirm that we're doing it right, here is an example of how to perform close using Berkley sockets as an analogy - https://datatracker.ietf.org/doc/html/rfc6455#section-7.1.1
As an example of how to obtain a clean closure in C using Berkeley sockets, one would call shutdown() with SHUT_WR on the socket, call recv() until obtaining a return value of 0 indicating that the peer has also performed an orderly shutdown, and finally call close() on the socket.
This is exactly what close
is doing right now.
Discussed at today's meeting,
In the end it's just a naming misunderstanding: for me close
should be instantaneous, and there should be a closeWait
which actually waits for the connection to be closed.
We might just rename close
to closeWait
to stay consistent with TCP, but not sure.
And I'll add a timeout inside libp2p
for me close should be instantaneous, and there should be a closeWait which actually waits for the connection to be closed.
Just to clarify, in chronos, calling close
is synchronous, but you are expected to call join
right after, in essence, closeWait()
does exactly that, calls close()
and then, await join()
.
If we follow the logic of TCP etc, a
close
call on a session should not block, and should instead close in the background (for instance withasyncSpawn
)It would also make sense to add a timeout to the close operation, since a non-responsive peer could block indefinitely