Strange crash: KeyError while trying to 'del runner.deadlines[...]' on a cancel scope with -inf deadline

Zwork101 commented 5 years ago

What's the issue Trio, after a long period of time, will stop working and raise an internal error.

Steps to reproduce

My code can be found here, you'll need a discord bot token to run it which can be created with a discord account here
Set a the environment variable 'TOKEN' to it the bot's token, and run the examples/test.py file.
Wait a multitude of hours, I'd say this took around 16-18 hours.
Traceback is raised

Traceback (It's a long one)

GOT: {'t': 'PRESENCE_UPDATE', 's': 25582, 'op': 0, 'd': {'us'....}
Sent: Opcodes.Heartbeat 25582
(Above result is from my program)
Traceback (most recent call last):
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 1323, in run
    run_impl(runner, async_fn, args)
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 1471, in run_impl
    runner.task_exited(task, final_outcome)
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 943, in task_exited
    task._cancel_stack[-1]._remove_task(task)
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 202, in _remove_task
    self._tasks.remove(task)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\contextlib.py", line 119, in __exit__
    next(self.gen)
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 156, in _might_change_effective_deadline
    del runner.deadlines[old, id(self)]
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\sortedcontainers\sorteddict.py", line 259, in __delitem__
    self._dict_delitem(key)
KeyError: (-inf, 60341456)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:/Users/User/PycharmProjects/pycord/pycord/examples/test.py", line 7, in <module>
    client.run()
  File "C:\Users\User\PycharmProjects\pycord\pycord\pycord\client\client.py", line 70, in run
    self.gateway.start()
  File "C:\Users\User\PycharmProjects\pycord\pycord\pycord\gateway\gate.py", line 351, in start
    trio.run(self._start)
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 1329, in run
    ) from exc
trio.TrioInternalError: internal error in trio - please file a bug!
Exception ignored in: <function Nursery.__del__ at 0x0306DC48>
Traceback (most recent call last):
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 529, in __del__
AssertionError: 
Exception ignored in: <coroutine object TrioGateway.run at 0x033E96F0>
RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <function Nursery.__del__ at 0x0306DC48>
Traceback (most recent call last):
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 529, in __del__
AssertionError: 
Exception ignored in: <coroutine object WebSocketConnection._reader_task at 0x039532B8>
Traceback (most recent call last):
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio_websocket\_impl.py", line 928, in _reader_task
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_ssl.py", line 646, in receive_some
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_ssl.py", line 560, in _retry
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_sync.py", line 98, in __aexit__
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_ki.py", line 165, in wrapper
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_sync.py", line 585, in release
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 1534, in current_task
RuntimeError: must be called from async context
Exception ignored in: <function Nursery.__del__ at 0x0306DC48>
Traceback (most recent call last):
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 529, in __del__
AssertionError: 
Exception ignored in: <function AsyncGenerator.__del__ at 0x00EFDF18>
Traceback (most recent call last):
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\async_generator\_impl.py", line 415, in __del__
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\async_generator\_impl.py", line 363, in step
ValueError: async generator already executing

Expected Result The program should continue running, displaying the output such as the first 2 lines provided in the traceback.

njsmith commented 5 years ago

😕 yeah that's not supposed to happen. Let's see if we can track it down.

Fortunately the traceback contains a lot of information. The relevant part is this:

  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 1471, in run_impl
    runner.task_exited(task, final_outcome)
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 943, in task_exited
    task._cancel_stack[-1]._remove_task(task)
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 202, in _remove_task
    self._tasks.remove(task)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\contextlib.py", line 119, in __exit__
    next(self.gen)
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\trio\_core\_run.py", line 156, in _might_change_effective_deadline
    del runner.deadlines[old, id(self)]
  File "C:\Users\User\PycharmProjects\pycord\venv\lib\site-packages\sortedcontainers\sorteddict.py", line 259, in __delitem__
    self._dict_delitem(key)
KeyError: (-inf, 60341456)

I think what it's saying is:

A task exited
The task had a cancel scope active at the time when it exited. (Normally, this means it was inherited from the parent task, but it could also happen if somehow some code called __enter__ and never called __exit__)
Trio updated its internal state to record that this task should not be associated with this cancel scope any more
This caused the cancel scope's effective deadline to change. This is weird, because AFAICT the only way _remove_task could change the effective deadline is if this was the last task associated with this cancel scope. But like we said above, the only way we should have been able to get here is if this task inherited the scope from its parent, and the parent should still be running, so this shouldn't be the last task in the scope.
Since the effective deadline changed, Trio tries to update its global deadline table by removing the old deadline, before it inserts the new one.
But, the old deadline wasn't in the table, so this crashes with KeyError. This is also weird, because the invariant is supposed to be that any cancel scope that has at least one task, isn't cancelled, and whose deadline is not +inf, should have an entry in the table.
And finally, from the exception message, we learn that the cancel scope's old deadline was -inf, which is odd and seems like it should be a clue, but I'm not immediately sure what to make of it. The built-in checkpoint function creates cancel scopes with a deadline of -inf, but those should exit immediately; I don't see how one could be attached to an exiting task. The current_effective_deadline function uses -inf to indicate that some enclosing cancel scope has already been cancelled, but that doesn't change the actual deadline of any specific cancel scope, unless you do something like with move_on_at(current_effective_deadline()).

So, the first questions that come to mind are:

Are you ever, to your knowledge, creating cancel scopes with a deadline of -inf?
Are you using cancel scopes or nurseries inside a generator or async generator? That can cause very weird things to happen, since yielding out of a with block is one way to accidentally break the normal with block rules.

oremanj commented 5 years ago

I downloaded your code and trio_websocket and looked for red flags. The only suspicious thing I found was that your code appears to create a websocket connection in one call to trio.run() and try to close it in another call:

    async def run(self):
        async with trio_websocket.open_websocket_url(self.gateway_url()) as conn:
            self._conn = conn
            ...

    async def _start(self):
        async with trio.open_nursery() as nursery:
            nursery.start_soon(self.run)
            nursery.start_soon(self.heartbeat)
            print("Started Nursery")

    def start(self):
        trio.run(self._start)

    async def _close(self):
        if self._conn is None:
            raise GatewayError("You tried to close the gateway connection before it was established.")
        await self._conn.aclose(1001)
        self._closed = True

    def close(self):
        trio.run(self._close)

In general, different calls to trio.run() are in different universes and it won't work to share state between them. I wasn't able to reconstruct a story for how you could get this specific error by doing that, but it's a thing to look out for.

Also from your code it looked like maybe it was possible to call close() from inside the start() loop? i.e., a run() inside a run(). That won't work either, but trio.run() is supposed to have a check for that which provides a useful error. Unless there are threads (real OS threads, not trio tasks) somewhere that I'm missing -- each thread can have its own run(), and trying to use the same websocket connection from multiple threads simultaneously could explain the behavior you're seeing.

Thanks for the bug report, and sorry not to be able to provide a more definitive answer!

Zwork101 commented 5 years ago

Thanks! I might be doing a lot of things wrong with trio I don't know about. To answer some questions:

I don't think I ever cancle any calls at all, and definitely not with -inf.
The only nursery created is in an async function, however that's about it.
I'm not using any threads at all.

I think I ran it using PyCharms's debugger. Could that have caused issues?

Fuyukai commented 5 years ago

You've pinned a very old version of trio_websocket in your requirements. Update it and see if the issue stays?

Zwork101 commented 5 years ago

Will do, didn't realize it was old.

njsmith commented 5 years ago

Oh, nice catch. But I don't see any mention of inf or yield or current_effective_deadline in the old trio-websocket 0.2.0 either, so still no smoking gun...

oremanj commented 5 years ago

Closing since I don't see how we could realistically make progress on this with the information we have, and the multiple calls to trio.run() create lots of opportunity for some unexpected and unsupported interaction. Feel free to reopen if it reoccurs.

njsmith commented 5 years ago

(To be clear in case anyone stumbles across this from a search engine years from now: Doing multiple calls to trio.run is fine and totally supported. What's not supported is creating Trio objects like network connections, events, locks, etc. inside one call to trio.run, and then re-using the same objects in a different call to trio.run.)

python-trio / trio

Strange crash: KeyError while trying to 'del runner.deadlines[...]' on a cancel scope with -inf deadline #837