Closed RemiCardona closed 5 years ago
Great analysis @RemiCardona.
I have a very similar issue where the recv()
raises the same AssertionError
exception instead of CancelledError
when the task is cancelled. The issue does not always occur. The issue is still present in v6.0. Let me know if there is a way I can help.
Would the pattern of using a single flow of logic with finally and except blocks to ensure that the state of things is correct in the face of various types of errors help resolving (and in the future, reasoning about) these issues? IIRC, a while back I was suggesting this pattern instead of using independent, parallel tasks that you have fewer guarantees about with respect to ordering and completion.
On Thu, Aug 30, 2018 at 7:21 AM Jean-Francois Levesque < notifications@github.com> wrote:
Great analysis @RemiCardona https://github.com/RemiCardona.
I have a very similar issue where the recv() raises the same AssertionError exception instead of CancelledError when the task is cancelled. The issue does not always occur. The issue is still present in v6.0. Let me know if there is a way I can help.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aaugustin/websockets/issues/465#issuecomment-417336991, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVt7v0Fm5tcRC7mnGuuMidU0V2iIKTIks5uV_TggaJpZM4WTbRh .
A few thoughts -- I'll investigate further at not-11pm.
If the TCP connection is open (i.e. the kernel doesn't consider that it timed out; kernel timeouts are large) but bytes don't come through, then you can expect that closing the connection will be slow and will only happen after a timeout. This is especially true on the client side because it's supposed to want for the server to close the TCP connection (search for TIME_WAIT in RFC6455 if you're curious).
You can reduce the close timeout with the timeout
argument if you'd like connections to be closed faster in these circumstances. At worst you will get a 1006 close code instead of something more proper. Most likely you don't care. A very small timeout will result in connections getting always closed forcibly.
If you want to get hacky, you can set self.timeout = 0.001
on protocol instances that you want to kill quickly. Or you can grab a reference to the socket and abort it; that will probably be more messy.
I suspect websockets doesn't track the state sufficiently precisely during the closing handshake. It only tracks the OPEN / CLOSING / CLOSED states described by the RFC. But there's a state machine with more states hidden in the current implementation.
fail_connection
is a private API. You should be using close
. That said, fail_connection
could be called in other circumstances and create a similar problem.
Your expectations about how fail_connection
handles the close code and reason aren't correct.
When you call close()
or fail_connection()
, you define the close frame that is sent. Perhaps the other endpoint will echo the same code and reason but that isn't guaranteed.
Per the spec, the connection close code and reason are defined by the close frame that is received. If no close frame is received, the close code is 1006.
It would be interesting to test if the built-in implementation of "ping every N seconds and disconnect if the pong is lost" is also vulnerable to this problem. I'm obviously luring you into running master in production because, unlike you, I don't have a large number of devices connected to a 3G network in developing countries.
Hitting an AssertionError is bad. It's a bug.
You don't have to serialize ping
and recv
. websockets must be coroutine-safe. If it isn't, it's a bug.
Based on code inspection, here's what I think is happening.
recv
fail_connection
with code 1006
transfer_data_task
while it's waiting for read_message
transfer_data_task
terminates as "cancelled"wait
call https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L336-L338 returns because transfer_data_task
was cancelled
Conclusion: don't call fail_connection() with the default code (1006) unless:
a. you changed the connection state to CLOSING, typically by writing a Close frame, or
b. you know that the connection is dead already, typically because you hit ConnectionError
All calls to fail_connection
in version 5.0.1 and in master seem to meet these criteria.
So I stand by this advice in my initial reaction: fail_connection
is a private API. You should be using close
.
@levesquejf I think you're hitting a different issue because, unlike @RemiCardona, you aren't calling fail_connection
.
I'd like to look into this, however I need more information about "the task is cancelled": what task are you talking about?
It would be best to open a separate issue if you're willing to provide more details.
@cjerdonek I'm still 80 / 20 in favor of the current design with a separate task to close the connection :-)
The problem discussed here is independent from that design: it only involves the "transfer_data" task and code run by the user. We don't reach the stage where the "close_connection" task kicks in.
This is an ongoing investigation but I think I have it narrowed down.
Some context first: in two programs (one client and one server, but I'll give examples from the client one), there are 2 tasks that do the following:
recv()
ping()
and awaits a pong every few seconds (without any sort of logic, it's incredibly naive). If the pong takes too long to arrive, the websocket connection is forcefully closed.The ping loop looks like this (with
self.protocol
being theWebSocketClientProtocol
instance):This bit of code running on an embedded device with the flakiest 3G connectivity you can imagine, packets are lost all the time and the timeout is thus hit very often. And when
fail_connection()
is run,recv()
running in another task hits the now-infamousassert
over in https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L349.Debug logging looks like this (with my slightly enhanced websockets logging code from https://github.com/RemiCardona/websockets/tree/try_to_debug_assert) :
There are some thoughts and questions, in no particular order:
CLOSING
state (as per https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L939) but it takes 20 seconds to go directly fromOPEN
toCLOSED
. I could reduce the timeout (caveat emptor: didn't try) but it looks as though that won't fix the race conditionfail_connection()
are silently ignored because the whole teardown path expects the code and reason to come from a close frame (which doesn't exist in this case) which can be seen here https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L653 (that's the only place whereclose_code
andclose_reason
are set)fail_connection(1006)
close the socket immediately instead of waiting for the 2 timeouts (closing and aborting the connection)? IOW, make error code 1006 even more a special case than it already is?recv()
while aping()
is ongoing (effectively admitting that websocket is not "thread"-safe)?If anyone got this far down this wall of text, I thank you from the bottom of my heart :)
Cheers