Cancelling a `fail_after` results in TooSlowError, even though the timeout didn't expire

joernheissler commented 6 years ago

I've got some code that uses fail_after and cancels itself, example below. My assumption was that the context manager just exits without any error. Yet it raises a Cancelled and TooSlowError. When I don't use fail_after but open_cancel_scope, things work as expected. I can't find anything in the docs that explain the difference. And I can't think of any good reason either. TooSlowError is especially strange because the timeout wasn't reached.

Version: ce131cd519a4ac3cb936ffb9e34e0a4f7e70bc24 on cpython3.6.6

#!/usr/bin/env python

import trio

async def main_a():
    with trio.CancelScope() as cancel_scope:
        await trio.sleep(1)
        cancel_scope.cancel()
        await trio.sleep(1)

    print('Hello!')

async def main_b():
    with trio.fail_after(10) as cancel_scope:
        await trio.sleep(1)
        cancel_scope.cancel()
        await trio.sleep(1)

    print('Hello!')

trio.run(main_a)
print("---")
trio.run(main_b)

Result is:

Hello!
---
Traceback (most recent call last):
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/trio/_timeouts.py", line 116, in fail_at
    yield scope
  File "./bug", line 19, in main_b
    await trio.sleep(1)
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/trio/_timeouts.py", line 85, in sleep
    await sleep_until(_core.current_time() + seconds)
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/trio/_timeouts.py", line 66, in sleep_until
    await sleep_forever()
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/trio/_timeouts.py", line 51, in sleep_forever
    await _core.wait_task_rescheduled(lambda _: _core.Abort.SUCCEEDED)
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/trio/_core/_traps.py", line 165, in wait_task_rescheduled
    return (await _async_yield(WaitTaskRescheduled(abort_func))).unwrap()
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/outcome/_sync.py", line 107, in unwrap
    raise self.error
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/outcome/_async.py", line 19, in capture
    return Value(sync_fn(*args, **kwargs))
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/trio/_core/_run.py", line 629, in raise_cancel
    raise exc
trio.Cancelled

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bug", line 26, in <module>
    trio.run(main_b)
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/trio/_core/_run.py", line 1334, in run
    raise runner.main_task_outcome.error
  File "./bug", line 19, in main_b
    await trio.sleep(1)
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/joern/.venv/trio/lib/python3.6/site-packages/trio/_timeouts.py", line 118, in fail_at
    raise TooSlowError
trio.TooSlowError

smurfix commented 6 years ago

On 28.09.2018 18:28, Jörn Heissler wrote:

|TooSlowError| is especially strange because the timeout wasn't reached.

The problem is that fail_after cannot discover that it was cancelled manually by yourself instead of the timeout.

Fix: create a secondary scope inside the fail_after, and cancel that.

-- -- Matthias Urlichs

joernheissler commented 6 years ago

Thanks! But is that a fix or is it a workaround? :-)

belm0 commented 6 years ago

Likely no one has ever thought to manually cancel a fail_after(). Would you describe your use case?

Behavior seems completely in line with docs, though I understand not what you're wanting:

Creates a cancel scope with the given timeout, and raises an error if it is actually cancelled.

This function and move_on_after() are similar in that both create a cancel scope with a given timeout, and if the timeout expires then both will cause Cancelled to be raised within the scope. The difference is that when the Cancelled exception reaches move_on_after(), it’s caught and discarded. When it reaches fail_after(), then it’s caught and TooSlowError is raised in its place.

Raises TooSlowError – if a Cancelled exception is raised in this scope and caught by the context manager

There are two things that can be cancelled: 1) the cancel scope, 2) the notion that an exception should be raised. The cancel() method can only be bound to one, and since it's a method on the CancelScope object it understandably is the former.

Possibly need a ~cancel_fail_after() method~ new property (see below).

belm0 commented 6 years ago

Once we head down this road, I think it should be possible to dynamically transition between plain cancel scope, move_on_after, and fail_after semantics while the cancel scope is in flight. It's already possible to do that with the first two via deadline. There should be an additional raise_on_cancelled read/write property which is a bool, or perhaps an exception object.

joernheissler commented 6 years ago

Likely no one has ever thought to manually cancel a fail_after(). Would you describe your use case?

In a nutshell: There's a long running task (might run infinitely). When a certain event happens, I want to finish it without error, so I'd cancel it. If the event does not occur within N seconds, I want the task to abort with an error.

That task receives messages from a stream and runs a callback for each message. The callback might then cancel the task.

My code and design are flawed and I'm going to rewrite it without using callbacks. Then the above use case won't exist anymore. I once learned that I need to use state machines and callbacks, unlearning this is hard.

Creates a cancel scope with the given timeout, and raises an error if it is actually cancelled.

I must have missed this. So it looks like the behaviour does match the docs.

Still I find it strange that cancelling one CancelScope causes it to return silently while cancelling another CancelScope causes it to fail with TooSlowError while in fact nothing was too slow.

njsmith commented 6 years ago

Huh, interesting case.

So in general, trio doesn't track the reason why a scope was cancelled – a timeout is exactly the same as calling cancel. This is because of you do try to track the difference then there are all kinds of ambiguous cases, e.g. what do you do if it timed out and then someone called cancel as well, or even vice-versa, what if the timeout expires while it's in the process of handling the manual cancel.

And fail_after is an extremely simple wrapper around a cancel scope: https://github.com/python-trio/trio/blob/master/trio/_timeouts.py

But yeah the end result is kind of surprising here!

Some options:

Leave the code as is, but document it better
Remove fail_* entirely; when people ask about them treat it as an opportunity to teach about the wonders of .cancelled_caught
Start tracking cancel scope cancellation reason, maybe just in a minimal way like "was the first call to cancel from a timeout or not"
Have fail_after check the current time at the moment when it decides whether to raise TooSlowError, and skip it if the timeout couldn't possibly have expired

njsmith commented 6 years ago

Hmm, above message was written this morning and then I forgot to hit post, so the discussion has overtaken it a bit.

I have trouble telling how useful fail_after really is, once you get used to doing things "the trio way". In other systems that don't have nestable cancel scopes, like asyncio, fail_after is kind of the only way to do things, so I partly just threw it in to make new migrants feel more at home. But that's probably not the only use case? It's hard to tell. We're all unlearning the old ways and trying to invent the new ways together :-).

If you're rewriting your code anyway, then maybe we should leave this be for the moment, and just keep this thread in mind in case it comes up again in the future?

smurfix commented 5 years ago

My use case (close enough): I'm connecting to a host and my connection-setup handshake is protected by a fail_after because, well, if that takes too long the host isn't useable. You can imagine what happens when the whole subtask is cancelled for completely unrelated reasons. I don't see an obviously-correct way to handle that case.

So, yes, I'm all for some way to fix this issue, preferably one that doesn't require two additional syscalls to gettimeinfo or similar.

oremanj commented 5 years ago

AFAICT, this issue only applies to cancelling the fail_after scope itself. Cancelling a scope that encloses the fail_after should work fine. The Happy Eyeballs example you mention seems to me like it would be of the latter species.

Are you seeing that cancelling a scope that encloses the fail_after results in a TooSlowError? That would definitely be a bug, and code to reproduce it would be appreciated.

smurfix commented 5 years ago

Yes, that's what I'm seeing. I don't have a test case (a trivial attempt to reproduce this failed) but in my code the fail_after's scope is not even saved.

oremanj commented 5 years ago

Are you using a released Trio, or master? There were some recent cancellation changes that could theoretically have bugs... I'm not able to reproduce the outer-cancellation-causes-inner-TooSlowError on master with a simple case either, though.

python-trio / trio

Cancelling a `fail_after` results in TooSlowError, even though the timeout didn't expire #698