uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.36k stars 91 forks source link

Code corruption with Pool? #254

Closed cbehren closed 1 year ago

cbehren commented 1 year ago

I have an issue in code that I cannot share, unfortunately. The problem is that when running code with a pool of workers, it throws a strange exception on a line that definitely should work. The line is t = t.to(astropy.units.s)

with t being a astropy quantity, a time in years in this case. The exception thrown says that astropy cannot convert years to seconds, which it of course can and does if I run the code in serial. I have checked that even when using Pool, the object looks okay - it has the right type, I can print its value and unit - so something really weird must be going here. t is part of the argument vector that each process gets.

Furthermore, I have figured out I can get rid of the problem if I take an import statement in code that is executed before the Pool.

Has anyone seen such behavior before? It looks like some sort of code corruption to me.

mmckerns commented 1 year ago

In short, yes. What you are describing is a common error. If you are using pathos.pools.ProcessPool, then the objects are serialized and passed to the worker processes. Serialization doesn't always capture all of the dependencies to reconstruct the object on the worker process. So, import statements and other tweaks that help the code be self-encapsulated can help an object serialize correctly. You can alter the serialization with dill.settings['recurse'] = True and other dill serialization settings. Recurse alters how dill serializes the dependencies. If you are using a pathos.pools.ParallelPool, then the pool extracts the source code and passes the code to the new processor. Then the pool compiles the code on the worker. You can also use a pathos.pools.ThreadPool, which doesn't serialize the code, and instead uses multiple threads on a single processor.

What pool are you using? What version of python, pathos, dill, multiprocess, and ppft?

cbehren commented 1 year ago

Thanks for your prompt reply! So, you are suggesting to play around with the dill settings, and/or to change the way stuff is imported, do I understand that correctly? Would it help to import less to the global scope (i.e. encapsulate more consistently)? Do you have a link or something where this kind of behavior is explained in a little bit more detail? It is hard for me to understand how a failure to serialize an object can lead to this...

With regard to your questions; we are using the ProcessPool. I will try with the ThreadPool today. Versions: pathos 0.2.9 python 3.8.12 dill 0.3.5.1 multiprocess 0.70.13 ppft 1.7.6.5

Funny thing: the code runs well on Windows, with the same environment.yaml.

mmckerns commented 1 year ago

ProcessPool makes sense, then you need to work code changes to better support serialization.

So, you are suggesting to play around with the dill settings, and/or to change the way stuff is imported, do I understand that correctly? Would it help to import less to the global scope (i.e. encapsulate more consistently)?

Yes, and a strong yes. If you are working with classes, you can even add a __reduce__ method.

Funny thing: the code runs well on Windows, with the same environment.yaml.

With multiprocessing, the major OS (Windows, Mac, Linux) all have different "contexts" they use (see multiprocess.context), so that's often the result of a "fork" versus a "spawn" context or similar.

Do you have a link or something where this kind of behavior is explained in a little bit more detail? It is hard for me to understand how a failure to serialize an object can lead to this...

Here's an old classic: https://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization

If you feel your question is answered, you can go ahead and close the issue.

cbehren commented 1 year ago

This was very useful. I will go ahead and try out what I learnt. Thank you!

cbehren commented 1 year ago

Me again; I found the culprit by bisecting the code and have a minimum working example. I am not sure whether this should be reported to astropy folks, or it is really a genuine multiprocessing problem.

Run on my linux machine, the code below will run into astropy.units.core.UnitConversionError: 'yr' (time) and 's' (time) are not convertible

The problem is cds.enable(), it enables using some additional units, but apparently messes with the existing units as well. The code can be found here: astropy doc

What do you think, is that a problem in astropy, or a general problem in multiprocessing?


from pathos.pools import ProcessPool
from astropy import units as u

def run_me(t):
    t_ = t.to(u.s)
    print(t_)
if __name__ == "__main__":
    from astropy.units import cds
    # the culprit is in the next line. Commenting it out will fix the problem
    cds.enable()
    p = ProcessPool(nodes=8)
    t = 1.*u.yr
    args = [t, t, t, t, t, t, t, t]
    result = p.map(run_me, args)```
mmckerns commented 1 year ago

I'd report this to astropy.