uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.38k stars 89 forks source link

Support of PyTorch Tensors on CPU #250

Closed mattiasmar closed 1 year ago

mattiasmar commented 2 years ago

Hello,

Does Pathos' ProcessPoolsupport PyTorch tensors on CPU? If not does any other class of Pathos support it or can I combine Pathos with torch.multiprocessing?

In the meantime I got this error when calling mapof ProcessPoolwhile sending a class that contains PyTorch tensors.

Exception has occurred: PicklingError
Can't pickle <built-in method tanh of type object at 0x7fde60f20da0>: it's not found as torch._VariableFunctionsClass.tanh

During handling of the above exception, another exception occurred:
mmckerns commented 2 years ago

What version of python, dill, multiprocess, pathos, pytorch, and any other relevant dependencies are you using? This may potentially be resolved by updating to the master version of dill... or by using a serialization variant (i.e. changing dill.settings). You could also find out what the serialization method used in torch.multiprocessing is (or extract the relevant function from the pickle registry), and then register it to dill. It'd also be useful to see the entire traceback. It would also be useful if you posted some minimal example code that reproduced your error.

mattiasmar commented 2 years ago

Versions:

Python 3.9.7
dill                          0.3.4
pathos                    0.2.8  
multiprocess              0.70.12.2
pytorch                   1.12.0              py3.9_cpu_0

I don't have a minimal example, however I can tell that when I use pool = multiprocessing.get_context('spawn').Pool(args.num_workers) my code runs smoothly, but when I use pool = pathos.pools.ProcessPool(args.num_workers) I get a pickle error: <very long stack trace> .... _pickle.PicklingError: Can't pickle <built-in method tanh of type object at 0x7f401db9fda0>: it's not found as torch._VariableFunctionsClass.tanh

Also, if I remove the PyTorch model that the subprocesses would have access to, pathos.pools.ProcessPool does not fail.

ddelange commented 2 years ago

On unix systems, the default start method is fork (also for MacOS, in contrast to stdlib). Keeping that in mind, pytorch.multiprocessing does not use multiprocess/dill and documents spawn only. I seem to remember them even firing a warning or error about using fork in the past, but I think that was when my subprocesses were trying to talk to CUDA in parallel, something they explicitly discourage for fork.

You could try multiprocess.set_start_method('spawn'), which will be slower due to pickling but probably more stable for PyTorch. Starting with https://github.com/uqfoundation/pathos/pull/252, you can also explicitly pass it like pathos.ProcessPool(4, context=multiprocess.get_context('spawn')).

mmckerns commented 1 year ago

I'm closing this as answered. Please reopen if you feel there's more to discuss.