python / cpython

The Python programming language
https://www.python.org
Other
62.39k stars 29.96k forks source link

multiprocessing.pool methods imap()[_unordered()] deadlock #67240

Closed 05e7202a-a74f-46a6-845a-af7a608ad541 closed 9 years ago

05e7202a-a74f-46a6-845a-af7a608ad541 commented 9 years ago
BPO 23051
Nosy @pitrou, @serhiy-storchaka, @MojoVampire, @applio, @indygreg
Files
  • Issue_23051_reproducer_v2_7.py: Reproducer of issue for Python 2.7
  • Issue_23051_fix_v2_7.patch: Fix for Python 2.7
  • Issue_23051_reproducer_v3_4.py: Reproducer of issue for Python 3.4
  • Issue_23051_fix_v3_4.patch: Fix for Python 3.4
  • issue_23051_fix_and_tests_v35_and_v34.patch: Ignore -- replaced by newer patch file.
  • issue_23051_fix_and_tests_v27.patch: Ignore -- replaced by newer patch file.
  • issue_23051_revised_fix_and_tests_v35_and_v34.patch: Revised, combined patch and unit tests for default/3.5 and 3.4 both (supercedes earlier patch files)
  • issue_23051_revised_fix_and_tests_v27.patch: Revised, combined patch and unit tests for 2.7 (supercedes earlier patch files)
  • issue_23051_4-3.4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/serhiy-storchaka' closed_at = created_at = labels = ['type-bug', 'library'] title = 'multiprocessing.pool methods imap()[_unordered()] deadlock' updated_at = user = 'https://bugs.python.org/advance512' ``` bugs.python.org fields: ```python activity = actor = 'indygreg' assignee = 'serhiy.storchaka' closed = True closed_date = closer = 'serhiy.storchaka' components = ['Library (Lib)'] creation = creator = 'advance512' dependencies = [] files = ['37449', '37450', '37451', '37453', '38220', '38221', '38362', '38363', '38377'] hgrepos = [] issue_num = 23051 keywords = ['patch'] message_count = 13.0 messages = ['232643', '232644', '234039', '234116', '236427', '236477', '237358', '237386', '237458', '238003', '238007', '238008', '248366'] nosy_count = 8.0 nosy_names = ['pitrou', 'python-dev', 'sbt', 'serhiy.storchaka', 'josh.r', 'davin', 'advance512', 'indygreg'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue23051' versions = ['Python 2.7', 'Python 3.4', 'Python 3.5'] ```

    05e7202a-a74f-46a6-845a-af7a608ad541 commented 9 years ago

    When imap() or imap_unordered() are called with the iterable parameter set as a generator function, and when that generator function raises an exception, then the _task_handler thread (running the method _handle_tasks) dies immediately, without causing the other threads to stop and without reporting the exception to the main thread (that one that called imap()).

    I saw this issue in Python 2.7.8, 2.7.9 and 3.4.2. Didn't check other versions, but I assume this is a bug in all Python versions since 2.6.

    I will be attaching examples that reproduce this issue, as well as patches for both Python 2.7 and Python 3.4.

    05e7202a-a74f-46a6-845a-af7a608ad541 commented 9 years ago

    The patches I attached do 2 things:

    1. A deadlock is prevented, wherein the main thread waits forever for the Pool thread/s to finish their execution, while they wait for instructions to terminate from the _task_handler thread which has died. Instead, the exception are caught and handled and termination of the pool execution is performed.
    2. The exception that was raised is caught and passed to the main thread, and is re-thrown in the context of the main thread - hence the user catch it and handle it, or - at the very least - be aware of the issue.

    I tested the patch to the best of my abilities, and am almost certain nothing was changed performance wise nor anything broken.

    Further eyes would, of course, only help for confirming this.

    applio commented 9 years ago

    Successfully reproduced the behavior playing through variations with the supplied examples (very helpful) on OS X v10.9.5 both in 2.7.9 and the default branch -- adding version 3.5 to issue.

    Also successfully verified that the supplied patches do just as advertised on same: OS X v10.9.5, both 2.7.9 and default branch.

    The patches' addition of a try around that for block looks like a sensible choice. At the moment, I do not yet fully understand what the patches' exception block is doing with the 'cache[job]._set' call the way it is, but I would like to spend more time to digest it better -- I hope to do so later this week.

    pitrou commented 9 years ago

    The patch would at least need to add a unit test in order to avoid regressions.

    applio commented 9 years ago

    For my part, I'm now good with all aspects of the patch supplied by Alon.

    I have a working set of tests that will be attached in the next day or two after seeing them behave on more than one platform.

    applio commented 9 years ago

    Attaching updated patch files that combine both Alon's fix with unit tests for it. Review of the unit tests especially would be welcomed.

    My one misgiving about these unit tests is that if at some future date there were a regression, the unhandled issue would potentially cause this unit test to hang -- it would be nicer if that did not have to be so.

    These two combined patches (one for default/3.5 and 3.4, one for 2.7) have been tested on OS X 10.10, Ubuntu Linux 12.04.5 64-bit, and Windows 7 64-bit (though only for the default/3.5 and 3.4 patch on Win7).

    serhiy-storchaka commented 9 years ago

    Added comments on Rietveld.

    applio commented 9 years ago

    Updated (1) the patch for default/3.5 and 3.4 and (2) the patch for 2.7 to reflect recommendations from the review.

    Thanks goes to Serhiy for the helpful review and especially the suggestion on better future-proofing in the tests.

    serhiy-storchaka commented 9 years ago

    May be the code would cleaner when convert the "for" loop to the "while" loop and wrap in try/except only next()?

    applio commented 9 years ago

    After pondering it for two days and coming back to it with hopefully "fresh eyes", I believe that changing the for-loop to a while-loop is not overall easier to understand -- I would lean towards keeping the for-loop.

    I do think the change to the while-loop very much made the exception handling logic clearer but seeing the while-loop with the "manual" invocation of next and incrementing of the variable i and re-use of i as a signal to break out of a loop (setting i = None) made other things less clear. My belief is that someone reading this method's code for the first time will read the for-loop version as, "try to loop through the enumerated tasks and if anything goes wrong then set the next position in the cache to 'failed'". That top-level reading is, I think, not quite as easy with the while-loop. Without the exception handling that we add in this patch, the original code used the for-loop and would, I think, have looked weird if it had tried to use a while-loop -- I think that's a sign that the for-loop is likely to be more easily understood by a first-time reader.

    Though I am not sure it really matters, the while-loop version would only help end the processing of further jobs if an exception occurs in the iterator whereas the for-loop version might help if exceptions occur in a couple of other places. We do not have a clear motivation for needing that however.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 9 years ago

    New changeset 525ccfcc55f7 by Serhiy Storchaka in branch '3.4': Issue bpo-23051: multiprocessing.Pool methods imap() and imap_unordered() now https://hg.python.org/cpython/rev/525ccfcc55f7

    New changeset 7891d084a9ad by Serhiy Storchaka in branch 'default': Issue bpo-23051: multiprocessing.Pool methods imap() and imap_unordered() now https://hg.python.org/cpython/rev/7891d084a9ad

    New changeset 311d52878a65 by Serhiy Storchaka in branch '2.7': Issue bpo-23051: multiprocessing.Pool methods imap() and imap_unordered() now https://hg.python.org/cpython/rev/311d52878a65

    serhiy-storchaka commented 9 years ago

    Thank you for your contribution Alon and Davin.

    236914ab-5504-4492-add8-6907a1140f5c commented 9 years ago

    For posterity, I think we ran into a similar problem in https://bugzilla.mozilla.org/show_bug.cgi?id=1191877, where our code uses apply_async():

    11:09:47     INFO -  Exception in thread Thread-2:
    11:09:47     INFO -  Traceback (most recent call last):
    11:09:47     INFO -    File "/tools/python/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    11:09:47     INFO -      self.run()
    11:09:47     INFO -    File "/tools/python/lib/python2.7/threading.py", line 504, in run
    11:09:47     INFO -      self.__target(*self.__args, **self.__kwargs)
    11:09:47     INFO -    File "/tools/python/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
    11:09:47     INFO -      put(task)
    11:09:47     INFO -  RuntimeError: dictionary changed size during iteration

    This resulted in deadlock, just like this issue.

    The added try..except around the iteration of taskseq likely fixes this as well.