[Hardening] Some edge case issues with AsyncRLock

Joshuaalbert commented 1 year ago

The following observations are made about AsyncRLock. These observations are intended to highlight a potential set of improvements, to harden code. The reason I'm looking into this is because we need an RLock for asyncio and discovered that python plans not to support it.

Treatment of `asyncio.CancelledError`

In the _acquire_non_blocking method, if the asyncio.sleep(0) call gets cancelled, the CancelledError exception is re-raised, potentially leading to improper cleanup.

The re-raising of the CancelledError will interrupt the execution flow, leaving no chance for the lock to be released if it was previously acquired by this task. However, I do note that the logic surrounding this operation seems to consider this case: if the lock is owned by the current task (hence the cancellation can impact the lock's state), the _count counter will be incremented, meaning the lock is not released. The code seems designed to avoid silent cancellation.

That said, this logic can indeed be risky, especially when it comes to proper handling of resources in case of cancellation. So, the concurrency control might benefit from a more robust way of handling cancellation that ensures resources (locks) are properly released.

Simple fix

To prevent this, we could modify the exception handling to check if the lock was acquired and release it before re-raising the exception.

try:
    await asyncio.sleep(0)
except asyncio.CancelledError:
    # Check if the lock was acquired and release it if necessary
    if self.is_owner(task=me) and self._count > 0:
        self._release(me)
    # This is emulating non-blocking. There is no cancelling this!
    # Still, we don't want to silently swallow the cancellation.
    # Hence, we flag this task as cancelled again, so that the next
    # `await` will raise the CancelledError.
    asyncio.current_task().cancel()

Missing error handling in `_acquire_non_blocking`

In _acquire_non_blocking, if the task.done() condition is met but the task completed with an exception (task.exception() is not None), it seems that neither the lock gets acquired nor the exception is handled in any way, which might lead to lost exceptions and the lock never being acquired. If task.exception() is not None, there's no fallback or handling for this scenario. The exception that occurred during the task is neither logged nor re-raised, which can make debugging difficult. Furthermore, this scenario will result in False being returned, indicating that the lock was not acquired, but without any further explanation or information.

Simple fix

To fix this, we can add an else clause to the condition checking task.done() and task.exception() is None. This clause would handle the case where an exception has occurred.

if task.done() and task.exception() is None:
    self._owner = me
    self._count = 1
    return True
elif task.done() and task.exception() is not None:
    # handle the case where the task has an exception
    raise task.exception()

Joshuaalbert commented 1 year ago

By the way here is our fair asyncIO RLock implementation.

https://gist.github.com/Joshuaalbert/36dabe4f7f9648763520d19e57fcce22

robsdedude commented 1 year ago

First of all, thank you very much for reaching out after having a close look at our async concurrency primitives. And especially for such a detailed analysis.

Sorry for my late response, but I was on vacation the past few weeks. I will try to take a close look at your remarks soon and might come back with a PR or further questions.

Joshuaalbert commented 1 year ago

Hi @robsdedude my pleasure. Also, we've decided to release our implementation, in case others need a fair async reentrant lock: https://pypi.org/project/fair-async-rlock/

robsdedude commented 1 year ago

Alright, so I'm investigating now.

First of all, _acquire_non_blocking is currently not even used other than in tests. But that is no excuse for it to be incorrect.

Point 1: Treatment of `asyncio.CancelledError`

[...] if the asyncio.sleep(0) call gets cancelled, the CancelledError exception is re-raised, potentially leading to improper cleanup. I does not get re-raised, I think. Calling asyncio.current_task().cancel() will mark the current task canceled, but won't raise anything until the current task decides to yield to the event loop (which we don't do in the rest of the method nor in the calling acquire) at which point the loop will throw a CancelledError into the co-routine.

But you are right in that the sleep could've been canceled and yet the enqueued super().acquier() could've already succeeded. But in that case, the following lines under the except clause will take care of reflecting that state change in the RLock (setting _owner and _count).

Also note how _owner and _count have not yet been touched in case of the sleep being cancelled. The idea is that _acquire_non_blocking should ignore any cancellation and acquire the lock nevertheless, but at the same time it should not swallow any cancellations. Hence, we defer the cancellation to the next await by marking the task as cancelled again.

Am I missing something in my assessment of the code here?

Point 2: Missing error handling in `_acquire_non_blocking`

:100: spot on! Thanks for spotting. I'll get a PR up for that. Even though, I'm uncertain under which circumstances asynio.Lock.acquire() would raise any exceptions.

robsdedude commented 1 year ago

Since I've not heard back in a month, I'll close the issue assuming my assessment of point 1 is correct :crossed_fingers:

Should you disagree, feel free to keep commenting or even re-open the issue. Thanks once more for reviewing our code! :bow: :raised_hands:

Joshuaalbert commented 1 year ago

@robsdedude Hi! Had a few Holidays in July and it slipped my mind to check back here, and on the issues you created in FairAsyncRLock (much appreciated btw). I think your assessment of point one is right, my mistake.

neo4j / neo4j-python-driver