uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.38k stars 89 forks source link

pathos.multiprocessing does not support nested and hierarchical maps when distributing large arrays #145

Closed gobbedy closed 6 years ago

gobbedy commented 6 years ago

Hello,

This code works just as fast as expected (~0.2 seconds)

#!/usr/bin/env python3.6
from time import time
from pathos.multiprocessing import Pool as ThreadPool
import numpy as np

def my_top_function():

    global my_array # Comment out this line to make it run 10x slower
    my_array = np.random.rand(10000000)

    def my_worker_function(number):
        my_array - my_array[0] # some arbitrary work

    k_list = np.arange(10)

    ts = time()
    pool = ThreadPool(4)
    pool.map(my_worker_function, k_list)
    pool.close()
    pool.join()
    te = time()
    print('Finished multiprocess: took %2.4f seconds.' % (te - ts))

my_top_function()

However, comment out global my_array and it slows down by 10x (takes ~2 seconds).

My actual code is part of a large class, and using a global would cause all sorts of problems.

Is there any way to avoid having to use global?

Also, why does my_array have to be global anyway? Isn't my_array global from the point of view of my_worker_function()?

I've been at this for two days and I'd be very grateful for some help.

gobbedy commented 6 years ago

Possibly a long shot, but could this issue be related to https://github.com/uqfoundation/dill/issues/56?

If I make a small change to my code it fails with the exact same error:

#!/usr/bin/env python3.6
from time import time
from pathos.multiprocessing import Pool as ThreadPool
import numpy as np

def my_top_function():

    #global cls # Comment out this line to make it fail -- DIFFERENT THAN CODE IN ORIGINAL COMMENT
    cls = type('NewCls', (object,), dict())  # -- DIFFERENT THAN CODE IN ORIGINAL COMMENT

    def my_worker_function(number):
        cls.test = 'bla' # some arbitrary work   -- DIFFERENT THAN CODE IN ORIGINAL COMMENT

    k_list = np.arange(10)

    ts = time()
    pool = ThreadPool(4)
    pool.map(my_worker_function, k_list)
    pool.close()
    pool.join()
    te = time()
    print('Finished multiprocess: took %2.4f seconds.' % (te - ts))

my_top_function()

Make cls global (uncomment global cls) and it passes again.

I'm not familiar enough with the package's code to be sure, but does this point to an issue in dill?

gobbedy commented 6 years ago

@mmckerns, could you please give your thoughts on whether the two issues are related?

mmckerns commented 6 years ago

@gobbedy: I haven't had time to check your code yet, but I will. It's an interesting problem. My quick assessment is that I'm unsurprised. The serialization for items in the globals, in __main__, imported from other files, and in locals... are treated very differently. Some variants (based on where the object is referenced) can see the pickle dramatically increase in size and scope of what it has to serialize (due to references).

I suggest trying your code above, but with dill.settings['recurse'] = True (or you can directly set recurse=True on dump/dumps). Let me know if that makes a difference. So, the biggest thing to watch for is things that reference globals, where it ends up forcing a pickle of all of the items in globals. recurse tries to avoid some of that.

gobbedy commented 6 years ago

Hi @mmckerns thanks for your quick reply.

I tried adding dill.settings['recurse'] = True and unfortunately that didn't work. To make sure we're on the same page, the imports for the code in both of my comments above are now the following:

#!/usr/bin/env python3.6
from time import time
import dill # added this
dill.settings['recurse'] = True # added this
from pathos.multiprocessing import Pool as ThreadPool
import numpy as np

This didn't change the behaviour of either testcase. That is, the code in https://github.com/uqfoundation/pathos/issues/145#issue-340060352 still runs slow when global my_array is commented out, and the code in https://github.com/uqfoundation/pathos/issues/145#issuecomment-404501230 still errors out in the same way.

Note that my code itself doesn't do any pickling. The pickling happens behind the scenes in the pathos.multiprocessing library.

gobbedy commented 6 years ago

@mmckerns are you saying that when my_array is global, only the reference is pickled and passed to the workers, whereas when my_array is local the entire array is pickled?

Because that I could see that making sense. Pickling the entire array would be very expensive.

Sorry if I'm way off the mark. I'm fairly new to pickling/serialization, especially in the context of multiprocessing.

mmckerns commented 6 years ago

I'm saying something like that, and there are a lot of variants on what dill will do, depending on the situation. But, yes, the worst case is it pickles the entire array and everything it refers to. If you want to see what's pickled in each case, turn on trace.

I believe it's: dill.detect.trace(True)

gobbedy commented 6 years ago

Thanks @mmckerns that makes sense.

I ran the dill.detect.trace(True). It's a little too cryptic for me to understand what it prints out. It's also 250+ lines so I won't paste it here.

But I do see numpy-related differences.

For example in the global version I see this (ie no numpy stuff in this particular section):

D2: <dict object at 0x2b2dda364948>
# D2
# F1
B3: <built-in function scalar>

Whereas at the same output location in the local version I see this (a reference to numpy object followed by further differences from the global version):

Ce: <cell at 0x2b71501a42b8: numpy.ndarray object at 0x2b7145869e90>
F2: <function _create_cell at 0x2b714e2f6f28>
# F2
B3: <built-in function _reconstruct>
mmckerns commented 6 years ago

The first of the snippets say:

starting to pickle a type 'D2' dict
finished pickling type 'D2' dict
finished pickling type 'F1' function
starting to pickle a type 'B3' function 'scalar'

and the second:

starting to pickle a cell 'Ce' pointing at a 'numpy.ndarray'
starting to pickle a type 'F2' function '_create_cell'
finished pickling type 'F2' function
starting to pickle a type 'B3' function '_reconstruct'

So, yes, the latter is pickling the numpy array.

gobbedy commented 6 years ago

Hi @mmckerns thanks for your reply.

Any chance you could dig into your hat to find a magic incantation to speed up the code?

As it stands, this issue means that pathos.multiprocessing doesn't support nested worker functions nor being used in objected-oriented code.

To clarify: 1) A nested worker function cannot use its parent's variables (without incurring a dramatic slowdow), rendering it nested in name only.
2) A worker function also cannot use object variables (again without dramatic slowdown) -- something I didn't show here but can do with another small testcase. This renders it incompatible with object orientation.

mmckerns commented 6 years ago

Well, I disagree. I think if you have large numpy arrays, then maybe, yes, there's a slowdown in the cases you mention... but if you aren't passing big arrays, then either of those cases work just fine. There are a lot of cases that demonstrate that it is supported, and works well. I'll give you that if you are using a large numpy array as a local variable, things can get slow if you end up having to serialize and pass the numpy array around. I'm more than happy to take suggestions on what to do about that.

gobbedy commented 6 years ago

@mmckerns ah ok, fair enough. I'm coming from scientific computing, where passing large arrays to the workers will often be the modus operandi. So my statement was a bit too broad.

And yes I'd be grateful for suggestions for passing large arrays from parent to worker function. My real code passes large torch arrays rather than numpy, so hopefully I can adapt your suggestions to pytorch.

mmckerns commented 6 years ago

I totally understand the drivers in scientific computing -- it's my home as well. Anyway, I meant if you have any suggestions on how to not serialize and pass large arrays around... I'd be happy to see what I can do to edit dill and or pathos to better support that. Right now, it's one of those things that I'd like to have better supported, but haven't had time to research for a better solution.

Also, please feel free to close this issue when you feel I've sufficiently answered your query.

gobbedy commented 6 years ago

@mmckerns oh I misread your sentence about suggestions, my bad.

Unfortunately I don't have that kind of expertise. My suggestions would be in the order of, "make it work as if the arrays were global" which is not what you're looking for.

As for closing, I'd like that others to be warned about this so they can learn from my issue. To make it easier to find I'll rename the issue to "pathos.multiprocessing does not support nested and hierarchical maps when distributing large arrays.

Sound good to you?

To be clear I'm not trying to throw shade, I still think this is an awesome library and a major improvement over the core multiprocessing library. I'd just like to help other developers in the future who try what I did.

mmckerns commented 6 years ago

Change the tittle to whatever you like. Note that large arrays are supported, they are just slow... as you have experienced. Thanks for opening the issue, and inquiring. I'll definitely let you know if a solution to make them faster happens.

gobbedy commented 6 years ago

Thanks @mmckerns, changed it.

Large arrays are slower than executing the code sequentially, thus defeating the purposing of multiprocessing. I continue to hold that makes them unsupported, but admittedly that's semantics.