race condition between fail_connection() and recv()

This is an ongoing investigation but I think I have it narrowed down.

Some context first: in two programs (one client and one server, but I'll give examples from the client one), there are 2 tasks that do the following:

one is an infinite loop around recv()
the other is a infinite loop that sends a ping() and awaits a pong every few seconds (without any sort of logic, it's incredibly naive). If the pong takes too long to arrive, the websocket connection is forcefully closed.

The ping loop looks like this (with self.protocol being the WebSocketClientProtocol instance):

while True:
    yield from asyncio.sleep(ping_interval)
    pong = yield from self.protocol.ping()
    try:
        yield from asyncio.wait_for(pong, timeout=ping_timeout)
    except asyncio.TimeoutError:
        yield from self.protocol.fail_connection(1006, "ping timeout")
        yield from self.protocol.close()

This bit of code running on an embedded device with the flakiest 3G connectivity you can imagine, packets are lost all the time and the timeout is thus hit very often. And when fail_connection() is run, recv() running in another task hits the now-infamous assert over in https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L349.

Debug logging looks like this (with my slightly enhanced websockets logging code from https://github.com/RemiCardona/websockets/tree/try_to_debug_assert) :

Aug 28 14:40:24 : INFO bluelink.transport: Connecting to wss://URL/
Aug 28 14:40:26 : DEBUG websockets.protocol: [0x759dc8d0] client - state = CONNECTING
Aug 28 14:40:27 : DEBUG websockets.protocol: [0x759dc8d0] client - event = connection_made(<asyncio.sslproto._SSLProtocolTransport object at 0x7596fe90>)
Aug 28 14:40:28 : DEBUG websockets.protocol: [0x759dc8d0] client - state = OPEN
Aug 28 14:40:28 : INFO bluelink.transport: Connected to wss://URL/
Aug 28 14:40:28 : DEBUG websockets.protocol: [0x759dc8d0] client > Frame(fin=True, opcode=10, data=b'', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:33 : DEBUG websockets.protocol: [0x759dc8d0] client > Frame(fin=True, opcode=9, data=b'\x94\x8eO\xbe', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:34 : DEBUG websockets.protocol: [0x759dc8d0] client < Frame(fin=True, opcode=10, data=b'\x94\x8eO\xbe', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:34 : DEBUG websockets.protocol: [0x759dc8d0] client - received solicited pong: 948e4fbe
Aug 28 14:40:34 : DEBUG websockets.protocol: [0x759dc8d0] client > Frame(fin=True, opcode=10, data=b'', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:39 : DEBUG websockets.protocol: [0x759dc8d0] client > Frame(fin=True, opcode=9, data=b'\x0eYO\xb2', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:42 : DEBUG websockets.protocol: [0x759dc8d0] client ! failing WebSocket connection: 1006 ping timeout
Aug 28 14:40:42 : DEBUG websockets.protocol: [0x759dc8d0] client - cancelled pending ping: 0e594fb2
Aug 28 14:40:42 : INFO bluelink.transport: Wrong internal websocket state, closing connection exception=AssertionError('wrong state <State.OPEN: 1>',)
Aug 28 14:40:52 : DEBUG websockets.protocol: [0x759dc8d0] client ! timed out waiting for TCP close
Aug 28 14:40:52 : DEBUG websockets.protocol: [0x759dc8d0] client x closing TCP connection
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client ! timed out waiting for TCP close
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client x aborting TCP connection
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client - event = connection_lost(None)
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client - state = CLOSED
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client x code = 1006, reason = [empty]

There are some thoughts and questions, in no particular order:

the race condition is pretty obvious when you take into account the various timestamps: the connection object doesn't go through the CLOSING state (as per https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L939) but it takes 20 seconds to go directly from OPEN to CLOSED. I could reduce the timeout (caveat emptor: didn't try) but it looks as though that won't fix the race condition
given the above, should I forcibly close the socket beneath websockets' feet?
the code and reason I pass to fail_connection() are silently ignored because the whole teardown path expects the code and reason to come from a close frame (which doesn't exist in this case) which can be seen here https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L653 (that's the only place where close_code and close_reason are set)
could/should fail_connection(1006) close the socket immediately instead of waiting for the 2 timeouts (closing and aborting the connection)? IOW, make error code 1006 even more a special case than it already is?
should I refactor my own code to halt recv() while a ping() is ongoing (effectively admitting that websocket is not "thread"-safe)?

If anyone got this far down this wall of text, I thank you from the bottom of my heart :)

Cheers

Great analysis @RemiCardona.

I have a very similar issue where the recv() raises the same AssertionError exception instead of CancelledError when the task is cancelled. The issue does not always occur. The issue is still present in v6.0. Let me know if there is a way I can help.

Would the pattern of using a single flow of logic with finally and except blocks to ensure that the state of things is correct in the face of various types of errors help resolving (and in the future, reasoning about) these issues? IIRC, a while back I was suggesting this pattern instead of using independent, parallel tasks that you have fewer guarantees about with respect to ordering and completion.

On Thu, Aug 30, 2018 at 7:21 AM Jean-Francois Levesque < notifications@github.com> wrote:

Great analysis @RemiCardona https://github.com/RemiCardona.

I have a very similar issue where the recv() raises the same AssertionError exception instead of CancelledError when the task is cancelled. The issue does not always occur. The issue is still present in v6.0. Let me know if there is a way I can help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aaugustin/websockets/issues/465#issuecomment-417336991, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVt7v0Fm5tcRC7mnGuuMidU0V2iIKTIks5uV_TggaJpZM4WTbRh .

A few thoughts -- I'll investigate further at not-11pm.

If the TCP connection is open (i.e. the kernel doesn't consider that it timed out; kernel timeouts are large) but bytes don't come through, then you can expect that closing the connection will be slow and will only happen after a timeout. This is especially true on the client side because it's supposed to want for the server to close the TCP connection (search for TIME_WAIT in RFC6455 if you're curious).

You can reduce the close timeout with the timeout argument if you'd like connections to be closed faster in these circumstances. At worst you will get a 1006 close code instead of something more proper. Most likely you don't care. A very small timeout will result in connections getting always closed forcibly.

If you want to get hacky, you can set self.timeout = 0.001 on protocol instances that you want to kill quickly. Or you can grab a reference to the socket and abort it; that will probably be more messy.

I suspect websockets doesn't track the state sufficiently precisely during the closing handshake. It only tracks the OPEN / CLOSING / CLOSED states described by the RFC. But there's a state machine with more states hidden in the current implementation.

fail_connection is a private API. You should be using close. That said, fail_connection could be called in other circumstances and create a similar problem.

Your expectations about how fail_connection handles the close code and reason aren't correct.

When you call close() or fail_connection(), you define the close frame that is sent. Perhaps the other endpoint will echo the same code and reason but that isn't guaranteed.

Per the spec, the connection close code and reason are defined by the close frame that is received. If no close frame is received, the close code is 1006.

It would be interesting to test if the built-in implementation of "ping every N seconds and disconnect if the pong is lost" is also vulnerable to this problem. I'm obviously luring you into running master in production because, unlike you, I don't have a large number of devices connected to a 3G network in developing countries.

Hitting an AssertionError is bad. It's a bug.

You don't have to serialize pingand recv. websockets must be coroutine-safe. If it isn't, it's a bug.

Based on code inspection, here's what I think is happening.

coroutine A is waiting for recv
coroutine B calls fail_connection with code 1006
- this cancels transfer_data_task while it's waiting for read_message
- this does NOT change the state to CLOSING because this branch https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L926 isn't taken
- transfer_data_task terminates as "cancelled"
in coroutine A, this wait call https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L336-L338 returns because transfer_data_task was cancelled
- the state is still OPEN, which triggers the assertion error

Conclusion: don't call fail_connection() with the default code (1006) unless:

a. you changed the connection state to CLOSING, typically by writing a Close frame, or b. you know that the connection is dead already, typically because you hit ConnectionError

All calls to fail_connection in version 5.0.1 and in master seem to meet these criteria.

So I stand by this advice in my initial reaction: fail_connection is a private API. You should be using close.

@levesquejf I think you're hitting a different issue because, unlike @RemiCardona, you aren't calling fail_connection.

I'd like to look into this, however I need more information about "the task is cancelled": what task are you talking about?

It would be best to open a separate issue if you're willing to provide more details.

@cjerdonek I'm still 80 / 20 in favor of the current design with a separate task to close the connection :-)

The problem discussed here is independent from that design: it only involves the "transfer_data" task and code run by the user. We don't reach the stage where the "close_connection" task kicks in.

python-websockets / websockets

race condition between fail_connection() and recv() #465