Closed cbehren closed 1 year ago
In short, yes. What you are describing is a common error. If you are using pathos.pools.ProcessPool
, then the objects are serialized and passed to the worker processes. Serialization doesn't always capture all of the dependencies to reconstruct the object on the worker process. So, import statements and other tweaks that help the code be self-encapsulated can help an object serialize correctly. You can alter the serialization with dill.settings['recurse'] = True
and other dill serialization settings. Recurse alters how dill
serializes the dependencies. If you are using a pathos.pools.ParallelPool
, then the pool extracts the source code and passes the code to the new processor. Then the pool compiles the code on the worker. You can also use a pathos.pools.ThreadPool
, which doesn't serialize the code, and instead uses multiple threads on a single processor.
What pool are you using? What version of python, pathos, dill, multiprocess, and ppft?
Thanks for your prompt reply! So, you are suggesting to play around with the dill settings, and/or to change the way stuff is imported, do I understand that correctly? Would it help to import less to the global scope (i.e. encapsulate more consistently)? Do you have a link or something where this kind of behavior is explained in a little bit more detail? It is hard for me to understand how a failure to serialize an object can lead to this...
With regard to your questions; we are using the ProcessPool. I will try with the ThreadPool today. Versions: pathos 0.2.9 python 3.8.12 dill 0.3.5.1 multiprocess 0.70.13 ppft 1.7.6.5
Funny thing: the code runs well on Windows, with the same environment.yaml.
ProcessPool makes sense, then you need to work code changes to better support serialization.
So, you are suggesting to play around with the dill settings, and/or to change the way stuff is imported, do I understand that correctly? Would it help to import less to the global scope (i.e. encapsulate more consistently)?
Yes, and a strong yes. If you are working with classes, you can even add a __reduce__
method.
Funny thing: the code runs well on Windows, with the same environment.yaml.
With multiprocessing, the major OS (Windows, Mac, Linux) all have different "contexts" they use (see multiprocess.context
), so that's often the result of a "fork" versus a "spawn" context or similar.
Do you have a link or something where this kind of behavior is explained in a little bit more detail? It is hard for me to understand how a failure to serialize an object can lead to this...
Here's an old classic: https://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization
If you feel your question is answered, you can go ahead and close the issue.
This was very useful. I will go ahead and try out what I learnt. Thank you!
Me again; I found the culprit by bisecting the code and have a minimum working example. I am not sure whether this should be reported to astropy folks, or it is really a genuine multiprocessing problem.
Run on my linux machine, the code below will run into
astropy.units.core.UnitConversionError: 'yr' (time) and 's' (time) are not convertible
The problem is cds.enable()
, it enables using some additional units, but apparently messes with the existing units as well. The code can be found here: astropy doc
What do you think, is that a problem in astropy, or a general problem in multiprocessing?
from pathos.pools import ProcessPool
from astropy import units as u
def run_me(t):
t_ = t.to(u.s)
print(t_)
if __name__ == "__main__":
from astropy.units import cds
# the culprit is in the next line. Commenting it out will fix the problem
cds.enable()
p = ProcessPool(nodes=8)
t = 1.*u.yr
args = [t, t, t, t, t, t, t, t]
result = p.map(run_me, args)```
I'd report this to astropy
.
I have an issue in code that I cannot share, unfortunately. The problem is that when running code with a pool of workers, it throws a strange exception on a line that definitely should work. The line is
t = t.to(astropy.units.s)
with
t
being a astropy quantity, a time in years in this case. The exception thrown says that astropy cannot convert years to seconds, which it of course can and does if I run the code in serial. I have checked that even when using Pool, the object looks okay - it has the right type, I can print its value and unit - so something really weird must be going here. t is part of the argument vector that each process gets.Furthermore, I have figured out I can get rid of the problem if I take an import statement in code that is executed before the Pool.
Has anyone seen such behavior before? It looks like some sort of code corruption to me.