python / cpython

The Python programming language
https://www.python.org
Other
63.06k stars 30.2k forks source link

asyncio.Queue.put_nowait(), followed get() task cancellation leads to item being lost #68000

Closed ea7c4c07-553b-4446-b959-72669dcdd94f closed 9 years ago

ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago
BPO 23812
Nosy @gvanrossum, @vstinner, @1st1
Files
  • bug-test.diff: test case patch
  • Issue23812.diff: A patch that fixes the problem
  • queue_bug.py
  • p.diff: patch to refactor Queue.get()
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['deferred-blocker', 'type-bug', 'expert-asyncio'] title = 'asyncio.Queue.put_nowait(), followed get() task cancellation leads to item being lost' updated_at = user = 'https://bugs.python.org/gustavo' ``` bugs.python.org fields: ```python activity = actor = 'yselivanov' assignee = 'none' closed = True closed_date = closer = 'yselivanov' components = ['asyncio'] creation = creator = 'gustavo' dependencies = [] files = ['38740', '38741', '38820', '40142'] hgrepos = [] issue_num = 23812 keywords = ['patch'] message_count = 24.0 messages = ['239611', '239624', '239625', '239627', '239628', '239679', '239738', '240006', '246762', '248055', '248056', '248137', '248138', '248144', '248145', '248146', '248148', '248150', '248157', '248181', '248187', '248203', '248206', '248207'] nosy_count = 5.0 nosy_names = ['gvanrossum', 'gustavo', 'vstinner', 'python-dev', 'yselivanov'] pr_nums = [] priority = 'deferred blocker' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue23812' versions = ['Python 3.4', 'Python 3.5'] ```

    ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago

    I have a pattern where I read from a queue with a timeout, generally like this:

    while True:
      reader = asyncio.async(wait_for(queue.get(), 0.1))
      try:
        item = (yield from reader)
      except asyncio.TimeoutError:
        reader.cancel()
        continue

    This is to have a loop where we try to get items from the queue with a timeout. Prior to Python 3.5, wait_for() doesn't automatically cancel the reading task, so I have to do it explicitly above.

    The code has a strange race condition where, if a taks calls put_nowait on a reader task that is just about to be cancelled due to timeout, then the item that wast put onto the queue gets lost.

    In the tests framework, the minimal test case I can come up with is mainly this code:

      q = asyncio.Queue(loop=loop)
      reader = loop.create_task(q.get())
      loop.run_until_complete(asyncio.sleep(0.01, loop=loop))
    
      q.put_nowait(1)
      q.put_nowait(2)
      reader.cancel()

    When the reader gets cancelled, the item 1 is lost into the ether, and when you create another reader for the same queue, it only gets the second item 2. I would expect that, if a reader task is cancelled, the item at the head of the queue doesn't get lost. Either the reader succeeds and the item is removed, or it doesn't succeed and the item is left alone.

    I attach a patch to the tulip tests that reproduces the problem in both Python 3.4.2 and tulip hg.

    gvanrossum commented 9 years ago

    Looks like a valid bug report, I like the test you provided, and the fix seems on the right track. Comments on the fix:

    ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago

    I created a codereview issue: https://codereview.appspot.com/222930043

    ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago
    • Are there other places where a cancellation can have a similar effect? Maybe the same logic in put()?

    Hm.. I didn't look, but yes, it does look like it might be affected by the same issue. I'll try to create a test for that to confirm.

    how can the yield-from in get() receive a CancelledError without the waiter being cancelled?

    Well, it definitely gets a CancelledError in the return (yield from waiter). I added extensive logging and could confirm this is the case.

    What happens is that

    1. Test creates a task q.get()
    2. The task coroutine, inside Queue.get, blocks on the yield from waiter
    3. a couple of items are put in the queue
    4. Test code calls task.cancel() (where task is running the coroutine q.get)
    5. The Queue.get task receives a CancelledError because it was cancelled. The exception appears to come from the yield from waiter only because this is the place where the Queue.get task was suspended, but waiter itself is not cancelled.
    gvanrossum commented 9 years ago

    Make sense. I'll be waiting for your updated patch. Thanks for both the bug report and the fix!

    ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago

    So I uploaded a new patch version fixing a similar problem in put().

    gvanrossum commented 9 years ago

    I'm sorry, I don't have time to review this (and it's subtle enough that I don't want to approve it without understanding).

    Maybe Victor understands?

    vstinner commented 9 years ago

    queue_bug.py: script to reproduce the bug.

    I confirm that Queue.get() sometimes looses items when it is cancelled. The waiter contains the result, but the waiter is lost when get() is cancelled.

    Queue.get() waiter got a result, but Queue.get() wakeup is only *scheduled* yet. It may even be executed in the same iteration, or maybe in the next loop iteration.

    I didn't review patches yet.

    ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago

    Don't know if it helps, but I made a github pull request for this:

    https://github.com/python/asyncio/pull/256

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 9 years ago

    New changeset 7aa2d3e1c885 by Yury Selivanov in branch '3.4': Issue bpo-23812: Fix asyncio.Queue.get() to avoid loosing items on cancellation. https://hg.python.org/cpython/rev/7aa2d3e1c885

    New changeset d5644d7e222d by Yury Selivanov in branch '3.5': Issue bpo-23812: Fix asyncio.Queue.get() to avoid loosing items on cancellation. https://hg.python.org/cpython/rev/d5644d7e222d

    New changeset 8f581da70ccd by Yury Selivanov in branch 'default': Issue bpo-23812: Fix asyncio.Queue.get() to avoid loosing items on cancellation. https://hg.python.org/cpython/rev/8f581da70ccd

    1st1 commented 9 years ago

    The fix is committed. Closing the issue. Thanks a lot, Gustavo!

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 9 years ago

    New changeset 024d4f4011c9 by Yury Selivanov in branch '3.4': Issue bpo-23812: Fix getter-cancellation with many pending getters code path https://hg.python.org/cpython/rev/024d4f4011c9

    New changeset 2752fe734bfb by Yury Selivanov in branch '3.5': Merge 3.4 (issue bpo-23812) https://hg.python.org/cpython/rev/2752fe734bfb

    New changeset e54c684e6788 by Yury Selivanov in branch 'default': Merge 3.5 (issue bpo-23812) https://hg.python.org/cpython/rev/e54c684e6788

    1st1 commented 9 years ago

    Guido, Victor,

    I've just pushed a commit to fix a misspelled method call in queues.py (related to the previous commit in this issue).

    Along with fixing the bug and writing a unittest for it, I discovered an issue with the current queues design.

    Here's an outline of the new unittest:

    1. we have a queue with three queue.get() coroutines waiting for its items: g1, g2, g3

    2. we do push_nowait(i1) and push_nowait(i2); we cancel g1

    3. the loop gets the chance to run

    4. g1 detects that it was cancelled and pushes the item i1 in front of the queue (otherwise we just lose it).

    Now, the problem is that g2 has already received i2 (by push_nowait); and i1 will end up consumed by g3!

    This all means that with getter coroutines being cancelled, it's possible that getters might receive queue items out of order.

    So the question is: is this a matter of documenting this behaviour, or we should consider some other design for queues implementation?

    gvanrossum commented 9 years ago

    Honestly, I've lost track of the queue design. Maybe the push-back on cancellation is just wrong? After all, if a coroutine has received an item, it's out of the queue, even if it gets cancelled before it can do anything with the item.

    On Thu, Aug 6, 2015 at 8:19 PM, Yury Selivanov \report@bugs.python.org\ wrote:

    Yury Selivanov added the comment:

    Guido, Victor,

    I've just pushed a commit to fix a misspelled method call in queues.py (related to the previous commit in this issue).

    Along with fixing the bug and writing a unittest for it, I discovered an issue with the current queues design.

    Here's an outline of the new unittest:

    1. we have a queue with three queue.get() coroutines waiting for its items: g1, g2, g3

    2. we do push_nowait(i1) and push_nowait(i2); we cancel g1

    3. the loop gets the chance to run

    4. g1 detects that it was cancelled and pushes the item i1 in front of the queue (otherwise we just lose it).

    Now, the problem is that g2 has already received i2 (by push_nowait); and i1 will end up consumed by g3!

    This all means that with getter coroutines being cancelled, it's possible that getters might receive queue items out of order.

    So the question is: is this a matter of documenting this behaviour, or we should consider some other design for queues implementation?

    ----------


    Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue23812\


    gvanrossum commented 9 years ago

    Honestly, I've lost track of the queue design. Maybe the push-back on cancellation is just wrong? After all, if a coroutine has received an item, it's out of the queue, even if it gets cancelled before it can do anything with the item.

    1st1 commented 9 years ago

    Honestly, I've lost track of the queue design. Maybe the push-back on cancellation is just wrong? After all, if a coroutine has received an item, it's out of the queue, even if it gets cancelled before it can do anything with the item.

    I think the push-back is a right thing. At least, I wouldn't expect wait_for(queue.get(), 0.1) code to lose queue items sometimes in big codebases.

    ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago

    I don't think the order for multiple concurrent getters matters that much. With analogy with the threading case, if multiple threads are blocked get()ing an item from the same queue, I would not presume to expect anything about the ordering which thread receives which item.

    That being said, I'm sorry patch caused this problem, but losing items is an infinitely worse problem.

    That being said, I think the problem stems from the way queue get is designed. Currently it is something like:

    1. get(): in case the queue is empty, create a Future, add it to "_getters", then begin waiting for this future;
    2. When an item is being put(), take the first getter and set its result to the item being put;
    3. the get() coroutine resumes and the item is found in the futures's result.

    A better design is to make it so the future that get() is waiting for doesn't actually receive the item, it is only used to "wake up" the get() coroutine. I would be something like:

    1. get(): in case the queue is empty, create a Future, add it to "_getters", then begin waiting for this future;
    2. When an item is being put(): 2.1. add the item to the internal queue 2.2. take the first getter and set its result to None
    3. the get() coroutine resumes, and then takes an item from the internal queue, returning it.

    Clearly, in this design, handling cancellation in get is much simpler: you don't need to do anything, just let it cancel, no cleanup action needed.

    1st1 commented 9 years ago

    A better design is to make it so the future that get() is waiting for doesn't actually receive the item, it is only used to "wake up" the get() coroutine. I would be something like:

    1. get(): in case the queue is empty, create a Future, add it to "_getters", then begin waiting for this future;
    2. When an item is being put(): 2.1. add the item to the internal queue 2.2. take the first getter and set its result to None
    3. the get() coroutine resumes, and then takes an item from the internal queue, returning it.

    Clearly, in this design, handling cancellation in get is much simpler: you don't need to do anything, just let it cancel, no cleanup action needed.

    I like this a lot. Do you want to draft a patch? :)

    ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago

    Sure, just give me a couple of days.

    gvanrossum commented 9 years ago

    So this looks like it will miss 3.5.0rc1. How confident are we that the new patch won't introduce new bugs? This late in the release process that would be awkward.

    On Fri, Aug 7, 2015 at 12:47 AM, Gustavo J. A. M. Carneiro \< report@bugs.python.org> wrote:

    Gustavo J. A. M. Carneiro added the comment:

    Sure, just give me a couple of days.

    ----------


    Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue23812\


    ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago

    I was wrong, there still needs to be some cleanup in cancellation, even with the new approach. But it does solve the out-of-order problem.

    I don't know if it should be applied to rc1. I wish I had more time to test. Up to you guys.

    1st1 commented 9 years ago

    Guido, I agree, let's not push the updated implementation in 3.5.0.

    Gustavo, could you please generate the patch with "hg diff", so that code review here works? And I think we need a new issue to track the new patch.

    ea7c4c07-553b-4446-b959-72669dcdd94f commented 9 years ago

    I am not using hg anymore, since asyncio migrated to git.

    Here's a github PR, does that help? https://github.com/python/asyncio/pull/260

    1st1 commented 9 years ago

    Yep, GH works. Thanks!