uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.38k stars 89 forks source link

make multiprocess work properly in jupyterlab on windows #219

Open simonnier opened 3 years ago

simonnier commented 3 years ago

It is well know that multiprocessing module has some severe issues in jupyterlab on windows. Unfortunately, multiprocess solve a limited case in jupyterlab on windows at the moment.

Below I provide several cases for discussion. code is put in a single jupyterlab cell to run.

Some of my packages version are:

jupyterlab 3.0.14 multiprocess 0.70.12.2 cloudpickle 1.6.0


1st case:

def foo(x):
    return x

def bar(z):
    return [foo(z)]

from multiprocess import Pool
with Pool(2) as p:
    print(p.map(bar,[1,2]))

run it pop up a message

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\qq\anaconda3\lib\site-packages\multiprocess\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\qq\anaconda3\lib\site-packages\multiprocess\pool.py", line 48, in mapstar
    return list(map(*args))
  File "<ipython-input-14-39de361cd185>", line 5, in bar
NameError: name 'foo' is not defined
"""

The above exception was the direct cause of the following exception:

NameError                                 Traceback (most recent call last)
<ipython-input-14-39de361cd185> in <module>
      7 from multiprocess import Pool
      8 with Pool(2) as p:
----> 9     print(p.map(bar,[1,2]))

~\anaconda3\lib\site-packages\multiprocess\pool.py in map(self, func, iterable, chunksize)
    362         in a list that is returned.
    363         '''
--> 364         return self._map_async(func, iterable, mapstar, chunksize).get()
    365 
    366     def starmap(self, func, iterable, chunksize=None):

~\anaconda3\lib\site-packages\multiprocess\pool.py in get(self, timeout)
    769             return self._value
    770         else:
--> 771             raise self._value
    772 
    773     def _set(self, i, obj):

NameError: name 'foo' is not defined

apparently, multiprocess can not recognize the foo called in bar.

As suggested https://stackoverflow.com/a/16891169/1911722, cloudpickle is "able to pickle a function, method, class, or even a lambda, as well as any dependencies." Let us try it

2nd case:

import cloudpickle

def foo(x):
    return x

def bar(z):
    return [foo(z)]

x = cloudpickle.dumps(bar)
del foo
del bar

import pickle

f = pickle.loads(x)
print(f(3))

from multiprocess import Pool
with Pool(2) as p:
    print(p.map(f,[1,2]))

it outputs

[3]
[[1], [2]]

First, print(f(3)) print correct result. It seems cloudpickle is "pickling" those dependencies quite well. Second, the p.map print correct result. At this moment, I almost thought that cloudpickle is a perfect tool to workaround the limitation of multiprocess. But let us go on

3rd case:

import cloudpickle

def h(x):
    return [x]

def foo(x):
    return h(x)

def bar(z):
    return [foo(z)]

x = cloudpickle.dumps(bar)
del foo
del bar
del h

import pickle

f = pickle.loads(x)
print(f(3))

from multiprocess import Pool
with Pool(2) as p:
    print(p.map(f,[1,2]))

Now bar calls foo, foo calls h, so that is a chain of 3 functions. You will notice that print(f(3)) still gives correct result which suggest cloudpickle is stilling pickling well. But p.map got error message.

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\qq\anaconda3\lib\site-packages\multiprocess\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\qq\anaconda3\lib\site-packages\multiprocess\pool.py", line 48, in mapstar
    return list(map(*args))
  File "<ipython-input-2-4643d366747a>", line 10, in bar
  File "<ipython-input-2-4643d366747a>", line 7, in foo
NameError: name 'h' is not defined
"""

The above exception was the direct cause of the following exception:

NameError                                 Traceback (most recent call last)
<ipython-input-2-4643d366747a> in <module>
     22 from multiprocess import Pool
     23 with Pool(2) as p:
---> 24     print(p.map(f,[1,2]))

~\anaconda3\lib\site-packages\multiprocess\pool.py in map(self, func, iterable, chunksize)
    362         in a list that is returned.
    363         '''
--> 364         return self._map_async(func, iterable, mapstar, chunksize).get()
    365 
    366     def starmap(self, func, iterable, chunksize=None):

~\anaconda3\lib\site-packages\multiprocess\pool.py in get(self, timeout)
    769             return self._value
    770         else:
--> 771             raise self._value
    772 
    773     def _set(self, i, obj):

NameError: name 'h' is not defined

p.map can not find the definition of h.

Conclusion

From the above several cases, it seems that cloudpickle indeed pickles function and its dependencies well. But multiprocess has some problems.

  1. without cloudpickle, multiprocess does not support a chain of two functions.
  2. with cloudpickle, multiprocess does not support chains of functions over three. But it seems promising to me that if multiprocess is properly combined with cloudpickle, it will solve all the problems in jupyterlab on windows.