Open evgeny-osipenko opened 1 year ago
This feels very similar to gh-102847, which was just closed. Since that has much more discussion we should probably keep that one open instead?
I think this issue shares some contexts with gh-102847, but targets more specific use case: safe synchronous cancellation. Could we reopen it, then it may be better to collect the discussion there. I'd love to see some improvements on the "canceller" side APIs and ergonomics as the OP says.
I don't think that adding a new API is going to help much. Realistic use cases of cancellation (e.g. as used in task groups and the timeout context manager) wouldn't be able to use it. Most use cases are better off using something like timeout anyway.
You may have made a case for updating the example, but I resent the suggestion that the author of the example "probably made a mistake". As you mention, in the example there is no way main()
itself can be cancelled. But if you submit a PR that updates the example and explains (in not too many words) why the cancelling()
check is needed it will probably be accepted. Alternatively, you could update the docs for cancelling()
to include the improved example, and just plant a link there in the text accompanying the basic cancel()
example. If you go that way, the cancel_me()
function should probably be reduced to just await sleep(3600)
, since the rest of the code is just there to demonstrate the execution flow through a coroutine that catches CancelledError
. (That's also true for main()
BTW -- so it's probably better to update the cancelling()
docs than the cancel()
docs.)
Let me describe the context where I encountered this question.
I am building a system composed of multiple services that run in parallel in separate Python instances. To provide synchronization between the services where necessary, I use a shared database with a "locks" table, where the lock's name field, constructed from a resource's global id, is covered with a unique index. When a service takes a lock, it tries to create a row in this table, and to release the lock, the row is deleted.
In addition, each lock is equipped with a watchdog timer, so that if a service bugs out, crashes or "forgets" to unlock the resource in any other way, the lock would time out by itself. When a service genuinely needs to hold the lock for a long time, however, it can periodically reset the watchdog, thus preventing it from timing out for as long as the service is actually functional.
And so, what I wanted to have, is a context manager that takes care of both creating/removing the row and resetting the watchdog timer, by spawning an asyncio task for the duration of the "managed" code. An important caveat is that the "resetter" task must be stopped and joined before the row is removed; because, if the "resetter" discovers that the row has been removed and there is nothing to update anymore, it's treated as if the lock has timed out, the resource may already be in the middle of another process, and it's no longer safe to continue, so it raises an exception and interrupts the parent task.
In the end, though, I implemented this behavior by constructing a TaskGroup from within the lock manager, calling its __aenter__
and __aexit__
manually, and putting the "resetter" task under its control:
class LockConflict(RuntimeError):
pass
class LockLost(RuntimeError):
pass
class Lock:
def __init__(self, name):
self.name = name
self.nonce = random.randbytes(8)
self.task_group = None
self.resetter_task = None
async def take(self):
created_successfully = await db.transaction(
lambda c: db_try_create_lock(c, self.nonce, self.name),
)
if not created_successfully:
raise LockConflict(
f"Lock {self.name} is already taken. Are you trying to run "
f"multiple instances of a non-multiplicable service at once?"
)
async def touch(self):
updated_successfully = await db.transaction(
lambda c: db_update_lock_watchdog(c, self.nonce, self.name),
)
if not updated_successfully:
raise LockLost(f"Lock {self.name} timed out")
async def release(self):
await db.transaction(
lambda c: db_release_lock(c, self.nonce, self.name),
)
async def async_touch_loop(self):
while True:
await asyncio.sleep(settings.service_watchdog_timer_seconds // 2)
await self.touch()
async def __aenter__(self):
if self.task is not None:
raise RuntimeError('Lock context manager entered twice')
try:
await self.take()
self.task_group = await asyncio.TaskGroup().__aenter__()
self.resetter_task = self.task_group.create_task(self.async_touch_loop())
return self
except:
# handle KeyboardInterrupt during initialization
await self.release()
raise
async def __aexit__(self, *exc_info):
try:
self.resetter_task.cancel()
await self.task_group.__aexit__(None, None, None)
finally:
await self.release()
# example usage:
async with Lock(f"record:{my_record.id}"):
await process_record(my_record)
So, in the end, it might indeed be better to just add a note to the cancel
example saying stuff like "check cancelling
before swallowing CancelledError
" and "actually, just use a TaskGroup
, it already has all that covered for you".
The same suggestion from another person: https://discuss.python.org/t/asyncio-cancel-a-cancellation-utility-as-a-coroutine-this-time-with-feeling/26304
My talk at pycon.us 2022 https://youtu.be/XW7yv6HuWTE brought up this as an issue I see a lot at Meta. People screwing up cancelling tasks. I even wrote my own utility into the later: https://github.com/facebookincubator/later/blob/main/later/task.py#L59
Though looking at my implementation you can't cancel it if the task its trying to cancel won't finish :P So yeah this is a work in progress. Would be nice to have this in the stdlib. And I can't just tell developers to use taskgroups especially if they are interfacing with 3rd party code.
I suspect this will need a PEP, since so many people have broken their heads over this (even @fried admits his solution isn't perfect. :-) See https://discuss.python.org/t/asyncio-cancel-a-cancellation-utility-as-a-coroutine-this-time-with-feeling/26304/3 (I'm not sure where we should discuss this -- here on GitHub, or there on Discourse.)
Speaking from my experience (yes, it's limited), this is an issue that should be solved while dealing with the whole task life-cycle, i.e. on a level above the asyncio, not in the asyncio.
Very closely related to cancellation are other problems I've encountered in my work: task monitoring (essential tasks may not exit or else let's shut down the whole app) and the shutdown management itself (task cancelling order, cleanup actions, etc). I have solved it "good enough" for my projects, i.e. far from being suitable for general use, but that work is the reason I'm now quite sure that a fully fledged tool for managing task and taking care of all the details, edge-cases etc. would create a complete asyncio add-on library and this particular issue belongs there.
The root problem here that await task
could raise either that task's exception or the current task's own one. But you're only calling it to wait till the task is done – you can check the result later by looking at task.cancelled()
. In fact you probably don't even care if it ends that way (rather than e.g. catching CancelledError
and returning).
So what you all really need is a way to just wait for the task to finish. Something like await task.wait_done()
. Then your synchronous cancellation is two steps:
task.cancel()
await task.wait_done()
await task.wait_done()
would never raise regardless of how task
finished, but (like any await
) would raise CancelledError
if the current task is cancelled. This is safe even if the task is already cancelled or already done. await task.wait_done()
would also be a lot more generally useful than something tailored to cancellation.
For the record, I don't have any need for this. This is just some drive-by API design 😊
async def cancel_and_wait(task, msg=None):
task.cancel(msg)
try:
await task
except asyncio.CancelledError:
if asyncio.current_task().cancelling() == 0:
raise
else:
return # this is the only non-exceptional return
else:
raise RuntimeError("Cancelled task did not end with an exception")
Isn't the if
condition the wrong way round? If asyncio.current_task().cancelling() == 0
then that means the outer task is NOT being cancelled, so the cancellation is from the inner task (await task
) so you would want to just return in that case.
[edit: there was a second half to this comment that had a stupid idea which I have now deleted]
Feature or enhancement
I propose adding a function to
asyncio
to cancel a task and then safely wait for the cancellation to complete.The main difficulty with this is deciding whether to swallow or re-raise the
CancelledError
once we catch one. It we call the task that we're waiting on "dependency", and the task doing the waiting "controller", then there are two distinct possibilities:CancelledError
all the way up its own stack, and then further up through theawait
in the "controller". In this case, theCancelledError
should be swallowed, and the "controller" can continue its work normally.await
. In this case, theCancelledError
is the signal to cancel the "controller" itself, and should be either re-raised further up, or its swallowing should be accompanied by a call touncancel
.The documentation for
asyncio.Task.cancel
, in fact, does not make this decision correctly, and thus would be a bug. Copying the example, and changing the names to match this issue's terminology:If I'm not missing anything, the correct procedure would look like this:
Thus, I propose to make these changes:
Introduce a function to
asyncio
orTask
:Having a specialized function would reduce the possibility of someone making this mistake in their code (like the author of the example probably did :) ), and allow the implementation to be changed or improved in the future. One such possible enhancement, for example, could be adding a
repeat
parameter to instruct the function, in case the task uncancels itself, to keep cancelling it again in a loop.In the documentation for
asyncio
, add a warning for this kind of mistake, in the "Task Cancellation" section or in the description ofasyncio.Task.cancel
.Change the code example for
asyncio.Task.cancel
to account for cancellation ofmain
. I know that, in this specific snippet, it is impossible formain
itself to be cancelled; but a developer unsuspecting of this issue may copy this example into a situation where the controller is indeed cancellable, and end up with a bug in their code.