python-websockets / websockets

Library for building WebSocket servers and clients in Python
https://websockets.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
5.18k stars 514 forks source link

race condition between fail_connection() and recv() #465

Closed RemiCardona closed 5 years ago

RemiCardona commented 6 years ago

This is an ongoing investigation but I think I have it narrowed down.

Some context first: in two programs (one client and one server, but I'll give examples from the client one), there are 2 tasks that do the following:

The ping loop looks like this (with self.protocol being the WebSocketClientProtocol instance):

while True:
    yield from asyncio.sleep(ping_interval)
    pong = yield from self.protocol.ping()
    try:
        yield from asyncio.wait_for(pong, timeout=ping_timeout)
    except asyncio.TimeoutError:
        yield from self.protocol.fail_connection(1006, "ping timeout")
        yield from self.protocol.close()

This bit of code running on an embedded device with the flakiest 3G connectivity you can imagine, packets are lost all the time and the timeout is thus hit very often. And when fail_connection() is run, recv() running in another task hits the now-infamous assert over in https://github.com/aaugustin/websockets/blob/5.0.1/websockets/protocol.py#L349.

Debug logging looks like this (with my slightly enhanced websockets logging code from https://github.com/RemiCardona/websockets/tree/try_to_debug_assert) :

Aug 28 14:40:24 : INFO bluelink.transport: Connecting to wss://URL/
Aug 28 14:40:26 : DEBUG websockets.protocol: [0x759dc8d0] client - state = CONNECTING
Aug 28 14:40:27 : DEBUG websockets.protocol: [0x759dc8d0] client - event = connection_made(<asyncio.sslproto._SSLProtocolTransport object at 0x7596fe90>)
Aug 28 14:40:28 : DEBUG websockets.protocol: [0x759dc8d0] client - state = OPEN
Aug 28 14:40:28 : INFO bluelink.transport: Connected to wss://URL/
Aug 28 14:40:28 : DEBUG websockets.protocol: [0x759dc8d0] client > Frame(fin=True, opcode=10, data=b'', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:33 : DEBUG websockets.protocol: [0x759dc8d0] client > Frame(fin=True, opcode=9, data=b'\x94\x8eO\xbe', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:34 : DEBUG websockets.protocol: [0x759dc8d0] client < Frame(fin=True, opcode=10, data=b'\x94\x8eO\xbe', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:34 : DEBUG websockets.protocol: [0x759dc8d0] client - received solicited pong: 948e4fbe
Aug 28 14:40:34 : DEBUG websockets.protocol: [0x759dc8d0] client > Frame(fin=True, opcode=10, data=b'', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:39 : DEBUG websockets.protocol: [0x759dc8d0] client > Frame(fin=True, opcode=9, data=b'\x0eYO\xb2', rsv1=False, rsv2=False, rsv3=False)
Aug 28 14:40:42 : DEBUG websockets.protocol: [0x759dc8d0] client ! failing WebSocket connection: 1006 ping timeout
Aug 28 14:40:42 : DEBUG websockets.protocol: [0x759dc8d0] client - cancelled pending ping: 0e594fb2
Aug 28 14:40:42 : INFO bluelink.transport: Wrong internal websocket state, closing connection exception=AssertionError('wrong state <State.OPEN: 1>',)
Aug 28 14:40:52 : DEBUG websockets.protocol: [0x759dc8d0] client ! timed out waiting for TCP close
Aug 28 14:40:52 : DEBUG websockets.protocol: [0x759dc8d0] client x closing TCP connection
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client ! timed out waiting for TCP close
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client x aborting TCP connection
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client - event = connection_lost(None)
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client - state = CLOSED
Aug 28 14:41:02 : DEBUG websockets.protocol: [0x759dc8d0] client x code = 1006, reason = [empty]

There are some thoughts and questions, in no particular order:

If anyone got this far down this wall of text, I thank you from the bottom of my heart :)

Cheers

levesquejf commented 6 years ago

Great analysis @RemiCardona.

I have a very similar issue where the recv() raises the same AssertionError exception instead of CancelledError when the task is cancelled. The issue does not always occur. The issue is still present in v6.0. Let me know if there is a way I can help.

cjerdonek commented 6 years ago

Would the pattern of using a single flow of logic with finally and except blocks to ensure that the state of things is correct in the face of various types of errors help resolving (and in the future, reasoning about) these issues? IIRC, a while back I was suggesting this pattern instead of using independent, parallel tasks that you have fewer guarantees about with respect to ordering and completion.

On Thu, Aug 30, 2018 at 7:21 AM Jean-Francois Levesque < notifications@github.com> wrote:

Great analysis @RemiCardona https://github.com/RemiCardona.

I have a very similar issue where the recv() raises the same AssertionError exception instead of CancelledError when the task is cancelled. The issue does not always occur. The issue is still present in v6.0. Let me know if there is a way I can help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aaugustin/websockets/issues/465#issuecomment-417336991, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVt7v0Fm5tcRC7mnGuuMidU0V2iIKTIks5uV_TggaJpZM4WTbRh .

aaugustin commented 6 years ago

A few thoughts -- I'll investigate further at not-11pm.


If the TCP connection is open (i.e. the kernel doesn't consider that it timed out; kernel timeouts are large) but bytes don't come through, then you can expect that closing the connection will be slow and will only happen after a timeout. This is especially true on the client side because it's supposed to want for the server to close the TCP connection (search for TIME_WAIT in RFC6455 if you're curious).

You can reduce the close timeout with the timeout argument if you'd like connections to be closed faster in these circumstances. At worst you will get a 1006 close code instead of something more proper. Most likely you don't care. A very small timeout will result in connections getting always closed forcibly.

If you want to get hacky, you can set self.timeout = 0.001 on protocol instances that you want to kill quickly. Or you can grab a reference to the socket and abort it; that will probably be more messy.


I suspect websockets doesn't track the state sufficiently precisely during the closing handshake. It only tracks the OPEN / CLOSING / CLOSED states described by the RFC. But there's a state machine with more states hidden in the current implementation.


fail_connection is a private API. You should be using close. That said, fail_connection could be called in other circumstances and create a similar problem.

Your expectations about how fail_connection handles the close code and reason aren't correct.

When you call close() or fail_connection(), you define the close frame that is sent. Perhaps the other endpoint will echo the same code and reason but that isn't guaranteed.

Per the spec, the connection close code and reason are defined by the close frame that is received. If no close frame is received, the close code is 1006.


It would be interesting to test if the built-in implementation of "ping every N seconds and disconnect if the pong is lost" is also vulnerable to this problem. I'm obviously luring you into running master in production because, unlike you, I don't have a large number of devices connected to a 3G network in developing countries.


Hitting an AssertionError is bad. It's a bug.


You don't have to serialize pingand recv. websockets must be coroutine-safe. If it isn't, it's a bug.

aaugustin commented 5 years ago

Based on code inspection, here's what I think is happening.

Conclusion: don't call fail_connection() with the default code (1006) unless:

a. you changed the connection state to CLOSING, typically by writing a Close frame, or b. you know that the connection is dead already, typically because you hit ConnectionError

All calls to fail_connection in version 5.0.1 and in master seem to meet these criteria.


So I stand by this advice in my initial reaction: fail_connection is a private API. You should be using close.

aaugustin commented 5 years ago

@levesquejf I think you're hitting a different issue because, unlike @RemiCardona, you aren't calling fail_connection.

I'd like to look into this, however I need more information about "the task is cancelled": what task are you talking about?

It would be best to open a separate issue if you're willing to provide more details.

aaugustin commented 5 years ago

@cjerdonek I'm still 80 / 20 in favor of the current design with a separate task to close the connection :-)

The problem discussed here is independent from that design: it only involves the "transfer_data" task and code run by the user. We don't reach the stage where the "close_connection" task kicks in.