python-trio / trio-websocket

WebSocket client and server implementation for Python Trio
MIT License
70 stars 26 forks source link

ConnectionClosed exception from serve_websocket() #96

Closed belm0 closed 5 years ago

belm0 commented 5 years ago

Hit a case where serve_websocket is leaking a ConnectionClosed exception. My handler doesn't appear to be involved in this trace. What am I missing?

  ...
  File "/.../site-packages/trio_websocket/_impl.py", line 327, in serve_websocket
    await server.run(task_status=task_status)
  File "/.../site-packages/trio_websocket/_impl.py", line 1088, in run
    await trio.sleep_forever()
  File "/.../site-packages/trio/_core/_run.py", line 397, in __aexit__
    raise combined_error_from_nursery
  File "/.../site-packages/trio/_highlevel_serve_listeners.py", line 129, in serve_listeners
    task_status.started(listeners)
  File "/.../site-packages/trio/_core/_run.py", line 397, in __aexit__
    raise combined_error_from_nursery
  File "/.../site-packages/trio/_highlevel_serve_listeners.py", line 27, in _run_handler
    await handler(stream)
  File "/.../site-packages/trio_websocket/_impl.py", line 1116, in _handle_connection
    await connection.aclose()
  File "/.../site-packages/trio/_core/_run.py", line 397, in __aexit__
    raise combined_error_from_nursery
  File "/.../site-packages/trio_websocket/_impl.py", line 921, in _reader_task
    await handler(event)
  File "/.../site-packages/trio_websocket/_impl.py", line 808, in _handle_connection_closed_event
    await self._write_pending()
  File "/.../site-packages/trio_websocket/_impl.py", line 959, in _write_pending
    raise ConnectionClosed(self._close_reason) from None
  trio_websocket._impl.ConnectionClosed: <CloseReason code=1006 name=ABNORMAL_CLOSURE reason=None>
belm0 commented 5 years ago

From the trace, it's clearly related to a newly incoming connection. There is probably server code using _write_pending but not handling ConnectionClosed. @mehaase would you take a look?

This is a trio-websocket reliability issue for my use case, happening enough to be a problem.

(Note that async stack traces can mislead by incorrectly attributing code location to the last line of an async block. E.g. the exception in _handle_connection could well be anywhere in the nursery scope, such as the nursery.start_soon(connection._reader_task) call, yet Python attributes it to the last line, 1116, await connection.aclose().)

mehaase commented 5 years ago

I'll look at it this week.

mehaase commented 5 years ago

I think I've managed to reproduce this. Here is an implementation of a faulty server:

from random import random

import trio
from trio_websocket import serve_websocket, ConnectionClosed

async def main():
    ''' Main entry point. '''
    print('Starting websocket server…')
    await serve_websocket(handler, 'localhost', 8000, ssl_context=None)

async def handler(request):
    ''' Reverse incoming websocket messages and send them back. '''
    try:
        ws = await request.accept()
        await ws.send_message('foo')
        await trio.sleep(random()/1000)
        await trio.aclose_forcefully(ws._stream)
    except ConnectionClosed:
        pass

if __name__ == '__main__':
    trio.run(main)

The client is this:

import trio
from trio_websocket import connect_websocket, ConnectionClosed

async def main():
    async with trio.open_nursery() as nursery:
        for n in range(1000):
            print('connection {}'.format(n))
            try:
                connection = await connect_websocket(nursery, 'localhost', 8000,
                    '/', use_ssl=False)
                await connection.get_message()
                await connection.aclose()
            except ConnectionClosed:
                print('connection closed')

if __name__ == '__main__':
    trio.run(main)

The client connects to the server repeatedly. Within a small number of iterations (often less than 10), the server crashes with the following exception.

$ python bad-server.py
Starting websocket server…
Traceback (most recent call last):
  File "bad-server.py", line 24, in <module>
    trio.run(main)
  File "/home/mhaase/code/trio-websocket/venv/lib/python3.6/site-packages/trio/_core/_run.py", line 1337, in run
    raise runner.main_task_outcome.error
  File "bad-server.py", line 10, in main
    await serve_websocket(handler, 'localhost', 8000, ssl_context=None)
  File "/home/mhaase/code/trio-websocket/trio_websocket/_impl.py", line 327, in serve_websocket
    await server.run(task_status=task_status)
  File "/home/mhaase/code/trio-websocket/trio_websocket/_impl.py", line 1091, in run
    await trio.sleep_forever()
  File "/home/mhaase/code/trio-websocket/venv/lib/python3.6/site-packages/trio/_core/_run.py", line 397, in __aexit__
    raise combined_error_from_nursery
  File "/home/mhaase/code/trio-websocket/venv/lib/python3.6/site-packages/trio/_highlevel_serve_listeners.py", line 129, in serve_listeners
    task_status.started(listeners)
  File "/home/mhaase/code/trio-websocket/venv/lib/python3.6/site-packages/trio/_core/_run.py", line 397, in __aexit__
    raise combined_error_from_nursery
  File "/home/mhaase/code/trio-websocket/venv/lib/python3.6/site-packages/trio/_highlevel_serve_listeners.py", line 27, in _run_handler
    await handler(stream)
  File "/home/mhaase/code/trio-websocket/trio_websocket/_impl.py", line 1119, in _handle_connection
    await connection.aclose()
  File "/home/mhaase/code/trio-websocket/venv/lib/python3.6/site-packages/trio/_core/_run.py", line 397, in __aexit__
    raise combined_error_from_nursery
  File "/home/mhaase/code/trio-websocket/trio_websocket/_impl.py", line 924, in _reader_task
    await handler(event)
  File "/home/mhaase/code/trio-websocket/trio_websocket/_impl.py", line 811, in _handle_connection_closed_event
    await self._write_pending()
  File "/home/mhaase/code/trio-websocket/trio_websocket/_impl.py", line 962, in _write_pending
    raise ConnectionClosed(self._close_reason) from None
trio_websocket._impl.ConnectionClosed: <CloseReason code=1006 name=ABNORMAL_CLOSURE reason=None>

The root cause is that the reader task is getting a connection closed event, and it wants to call _write_pending(), but before it can do that the stream is already broken. The reader task does not catch ConnectionClosed, which causes the exception to propagate out and cancel the entire nursery.

Keeping mind that I want to overhaul the close logic in #90 at a later date, I'll try to keep the fix for this bug as small as possible. The simplest fix is that the reader task should catch ConnectionClosed when it calls _write_pending() directly or when it calls it indirectly through a handler function. While chasing this down, I also noticed that WebSocketConnection.aclose() can also leak a ConnectionClosed exception, so I'll fix that, too.

With these changes, the scripts above run without error. I'll work on a test case tomorrow. The bug is triggered by a race (it only occurs if the server has a random delay added), so building a test case should be a matter of carefully sequencing events.

belm0 commented 5 years ago

Thank you! That does seem to match, and my app has a non-trivial amount of stuff on the trio scheduler which could cause the delay.