Problems with `trio.Lock()` across multiple tasks

Zac-HD commented 2 months ago

Some time ago - never mind how long precisely - having a little coordination between tasks and nothing particular to interest me in semaphores, I saw our RuntimeError("can't release a Lock you don't own").

I approve pretty strongly of keeping things local by default; it prevents all sorts of bugs and trio.Semaphore is right there if you need something without this limitation. But two further observations prompted this issue:

trio.Semaphore() is considerably more powerful and flexible in general. By the principle of least power, it might be nice to have trio.Lock(*, allow_outside_release: bool = False) or something along those lines.
If the owning task finishes (returns or raises) without releasing the lock, then you're stuck forever. The toy program below demonstrates. Perhaps we should raise an error or at least emit a warning when a task which owns a lock finishes?

import trio

@trio.run
async def main():
    lock = trio.Lock()
    async with trio.open_nursery() as nursery:
        nursery.start_soon(lock.acquire)
    # The lock is now held, and can only be released by the acquiring task.
    # However, that task is now finished, so we can never release the lock!
    lock.release()

A5rocks commented 1 month ago

I'm not sure how I feel about adding a keyword argument to trio.Lock. Does trio.Semaphore with a limit of 1 really have differences other than this? I like that trio.Lock is more easily debuggable and harder to shoot yourself with. (and if we make acquire error given the owner is dead, then even better!)

Perhaps we should raise an error or at least emit a warning when a task which owns a lock finishes?

I think that this could error when you run lock.acquire_nowait(). i.e. lock.acquire_nowait could check if self._owner is still running. This doesn't handle the following though:

task 1 calls lock.acquire
task 2 calls lock.acquire
task 1 exits

Zac-HD commented 1 month ago

Let's set Semaphore aside then, I'm more concerned that you can easily get a deadlock from Lock, e.g.

    lock = trio.Lock()
    async with trio.open_nursery() as nursery:
        nursery.start_soon(lock.acquire)
        nursery.start_soon(lock.acquire)

IMO instead of hanging forever, we should raise BrokenResourceError in the task(s) waiting on a lock which can never be released, and perhaps also when exiting the task which holds that lock.

A5rocks commented 1 month ago

I agree; BrokenResourceError (or RuntimeError if we want) is better than a deadlock. Though I think that if a lock gets acquired and never released, but nobody else wants it, it's OK. (Though it's definitely a sign of a bug? Should that warn?)

Zac-HD commented 1 month ago

I think so long as we get a BrokenResourceError instead of hanging, and can identify the offending lock + task if that happens, the other decisions are less crucial - e.g. I'd prefer to warn-or-error on task exit but only if that's not going to cost us performance in the happy case.

jakkdl commented 1 month ago

Erroring on

    lock = trio.Lock()
    async with trio.open_nursery() as nursery:
        nursery.start_soon(lock.acquire)
    await lock.acquire()

Is easy & straightforward, adding a check in acquire_nowait for self._owner not in self._owner._parent_nursery._children.

But the other cases are definitely more complicated, since the other task(s) might start waiting before the task owner exits. I think the way to go is to

Add logic to Runner.task_exited which checks if the exiting task is holding any lock
- which I think means we need to store held locks in the task.
notify all other tasks waiting on those locks that they should error
- Could make this a method in the public API on locks, error_other_holders
- also gotta look at how you cancel and inject an error into those tasks

I don't think this requires a big performance hit, but we'd likely need to add some overhead whenever acquiring/releasing a lock, which perhaps partially could be traded off with more logic in task_exited.

A5rocks commented 1 month ago

I think better than a parent nursery check is just checking if the task is running. But yeah I'm not sure how to handle the other cases. Should there be a way to "poison" a parking lot?

jakkdl commented 4 weeks ago

Another way of implementing it (inspired by @mikenerone) is having an Event firing on a Task exiting, and that event would wake up any other tasks waiting inside acquire. This initially sounded conceptually simple, but when I started thinking about how to actually implement it (acquire waiting for that Event firing or its ParkingLot.park(), and the lock event changing each time the lock owner changes) I'm not sure how clean it would be.

A5rocks commented 3 weeks ago

Here's my latest idea: we can have a global dictionary of task -> a list (not a set because need to allow multiple of the same element) of parking lots, w/ an interface like trio.lowlevel.add_parking_lot_breaker(task, parking lot) and trio.lowlevel.remove_parking_lot_breaker(task, parking lot).

Then lock uses that to have its parking lot broken if the task gets killed without releasing the lock. We can document this for use for anything that expects a task to release something but didn't.

jakkdl commented 3 weeks ago

Then lock uses that to have its parking lot broken if the task gets killed without releasing the lock

The lock initiates it?

Trying to flesh this out:

_LockImpl.acquire_now calls trio.lowlevel.add_parking_lot_breaker(self._owner, self._lot)
_LockImpl.release calls trio.lowlevel.remove_parking_lot_breaker(self._owner, self._lot)
Runner.task_exited calls trio.lowlevel.break_parking_lots(task)
1. This calls ParkingLot.break() on all associated lots
2. ParkingLot.break calls Runner.reschedule(..., next_send=Outcome.Error(BrokenResourceError(...))) for each parked entity (if that's how you inject errors into other Tasks).

A5rocks commented 3 weeks ago

Then lock uses that to have its parking lot broken if the task gets killed without releasing the lock

The lock initiates it?

Yeah, if I understand what you're asking.

Trying to flesh this out:

_LockImpl.acquire_now calls trio.lowlevel.add_parking_lot_breaker(self._owner, self._lot)

_LockImpl.release calls trio.lowlevel.remove_parking_lot_breaker(self._owner, self._lot)

Runner.task_exited calls trio.lowlevel.break_parking_lots(task)

We don't have to expose trio.lowlevel.break_parking_lots, it's fine if the runner just uses the global dict.

i. This calls ParkingLot.break() on all associated lots ii. ParkingLot.break calls Runner.reschedule(..., next_send=Outcome.Error(BrokenResourceError(...))) for each parked entity (if that's how you inject errors into other Tasks).

ParkingLot.break also: 1) needs a different name because break is a keyword 2) should make any future ParkingLot.park calls error.

The error should probably contain a reference to the task that caused things to break. Maybe a subclass of BrokenResourceError.

CoolCat467 commented 3 weeks ago

How to handle future improvements when we can have multiple guest runs able to run at once?

A5rocks commented 3 weeks ago

The Task shouldn't compare the same, though if it does then ouch that's something we should fix. (Given there's no way to start a task on two loops). So it's fine to use it as a key.

jakkdl commented 3 weeks ago

we can have a global dictionary of task -> a list (not a set because need to allow multiple of the same element) of parking lots

why is it necessary to allow multiple of the same element? It's not required for lock, so I'm guessing you're thinking of some external scenario?

Opened #3081 with an initial implementation. It needs some polish but the overall approach looks solid

A5rocks commented 3 weeks ago

we can have a global dictionary of task -> a list (not a set because need to allow multiple of the same element) of parking lots

why is it necessary to allow multiple of the same element? It's not required for lock, so I'm guessing you're thinking of some external scenario?

Yeah I was thinking of a semaphore (cause it's still bad form to forget to release) and reentering within the same task.

jakkdl commented 3 weeks ago

we can have a global dictionary of task -> a list (not a set because need to allow multiple of the same element) of parking lots

why is it necessary to allow multiple of the same element? It's not required for lock, so I'm guessing you're thinking of some external scenario?

Yeah I was thinking of a semaphore (cause it's still bad form to forget to release) and reentering within the same task.

This made me look more into the other sync primitives, and CapacityLimiter might be a pretty good fit for an optional parameter that enables the same functionality (optional because of the _on_behalf_of methods). Semaphore would be a worse fit, but I suppose could also have it. (at that point we probably move the logic from _LockImpl to AsyncContextManagerMixin, or an intermediate class between those).

python-trio / trio

Problems with `trio.Lock()` across multiple tasks #3035