uqfoundation / multiprocess

better multiprocessing and multithreading in Python
http://multiprocess.rtfd.io
Other
631 stars 64 forks source link

Returning certain types of custom exceptions from Pool.map callbacks causes a hang #33

Open joshdk opened 7 years ago

joshdk commented 7 years ago

Observed this behavior when Pool.maping a callback that either returned a value or a custom exception type. Returning certain types of exceptions will cause the Pool.map call to never return (hang), and in the case of Python 2.7, you are unable to SIGINT the process (possibly because something handles KeyboardInterrupt).

I believe that the mishandling of these custom exceptions is caused by pickle internally, but there is still the issue that Pool.map does not return when this occurs.

I have written some Nose tests in a Gist to reproduce this issue, and to determine exactly which cases trigger this issue. The tests that are decorated with @skip will cause the tests to hang if they are run.

I am using Python version 2.7.10/3.5.2, and multiprocess version 0.70.5.

TLDR; If, from a Pool.map callback, you return a custom exception whose constructor is of arity n, but you pass n + 1 arguments to that exception's parent constructor, Pool.map will never return, and you may need to kill Python to regain control.

mmckerns commented 7 years ago

Thanks for your gist, it's a nice collections of tests. If you think it's a pickling issue, then the thing to do is to check if dill can serialize the objects -- see dill.check and dill.copy.

mmckerns commented 7 years ago

In python2.7, I see this:

>>> import dill
>>> dill.dill._trace(True)   # turn on pickle tracing
>>>
>>> class CustomError3(Exception):
...   def __init__(self):
...     msg = 'oops'
...     Exception.__init__(self, msg)
... 
>>> exception = CustomError3()
>>> dill.check(exception)
T2: <class '__main__.CustomError3'>
F2: <function _create_type at 0x10ec206e0>
# F2
T1: <type 'type'>
F2: <function _load_type at 0x10ec20668>
# F2
# T1
T4: <type 'exceptions.Exception'>
# T4
D2: <dict object at 0x10ed01398>
F1: <function __init__ at 0x10ed46aa0>
F2: <function _create_function at 0x10ec20758>
# F2
Co: <code object __init__ at 0x10e835db0, file "<stdin>", line 2>
T1: <type 'code'>
# T1
# Co
D1: <dict object at 0x10e75a168>
# D1
D2: <dict object at 0x10ed03168>
# D2
# F1
# D2
# T2
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.7.dev0-py2.7.egg/dill/dill.py", line 278, in loads
    return load(file)
  File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.7.dev0-py2.7.egg/dill/dill.py", line 267, in load
    obj = pik.load()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1139, in load_reduce
    value = func(*args)
TypeError: __init__() takes exactly 1 argument (2 given)

...which is exactly what is seen in the traceback from the Pool.map before it hangs.

From the pickle trace above, you can see that it appears to complete the dump (starts at T2 and ends at #T2), but fails on load due to the bad number of arguments in calling the parent.

I don't think it has anything to do with the class being a custom exception... and more that it's throwing an error due to the bad number of arguments when trying to create an instance.

Should throwing an exception cause a Pool.map to hang? Probably not. So there seems that there is some mishandling in there somewhere. I'll take some investigation. I'd imagine this has been seen in multiprocessing before as well.

joshdk commented 7 years ago

Thank you for looking into this! I completely agree with what you observed.

I'd imagine this has been seen in multiprocessing before as well.

I had to modify the test slightly, but I can confirm that this behavior is identical in multiprocessing (both the TypeError and the Python-2/3-specific hanging behavior).

from nose.plugins.skip import SkipTest
from nose.tools import assert_raises
from multiprocessing import Pool

class CustomError3(Exception):
    def __init__(self):
        msg = 'Something went wrong.'
        Exception.__init__(self, msg)

def callback(_):
    return CustomError3()

def test_5():
    """Test returning an extension that passes an additional argument to standard Exception."""

    error = CustomError3()
    assert isinstance(error, Exception)

    results = pool.map(callback, ['A', 'B', 'C'])
    assert len(results) == 3

pool = Pool(1)

I suspect that the behavior in multiprocess may be inherited from the original multiprocessing fork, but I would still be very interested in seeing the results of your investigation.