Processes stopped when passing large objects to function to be parallelized

alberduris commented 4 years ago

Problem:

Apply a NLP Deep Learning model for Text Geneartion over the rows of a Pandas Series. The function call is:

out = text_column.parallel_apply(lambda x: generate_text(args, model, tokenizer, x))

where args, tokenizer are light objects but model is a heavy object, storing a Pytorch model which weighs more than 6GB on secondary memory and takes up ~12GB RAM when running it.

I have been doing some tests and the problem arises only when I pass the heavy model to the function (even without effectively running it inside the function), so it seems that the problem is passing an object as argument that takes up a lot of memory. (Maybe related with the Sharing Memory strategy for parallel computing.)

After running the parallel_applythe output I get is:

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data tranfer (pipe) to transfer data between the main process and workers.
   0.00%                                          |        0 /      552 |
   0.00%                                          |        0 /      552 |
   0.00%                                          |        0 /      551 |
   0.00%                                          |        0 /      551 |
   0.00%                                          |        0 /      551 |
   0.00%                                          |        0 /      551 |
   0.00%                                          |        0 /      551 |
   0.00%                                          |        0 /      551 |

And it gets stuck there forever. Indeed, there are two processed spawned and both are stopped:

ablanco+  85448  0.0  4.9 17900532 12936684 pts/27 Sl 14:41   0:00 python3 text_generation.py --input_file input.csv --model_type gpt2  --output_file out.csv --no_cuda --n_cpu 8
ablanco+  85229 21.4 21.6 61774336 57023740 pts/27 Sl 14:39   2:26 python3 text_generation.py --input_file input.csv --model_type gpt2  --output_file out.csv --no_cuda --n_cpu 8

nalepae commented 4 years ago

Hello,

First could tell me if this issue arises also with classical pandas (if no, we are sure it is exclusively a pandarallel issue)
Could you also please try without progress bar and without using memory filesystem ? (pandarallel.initialize(use_memory_fs=False)).

I guess it won't work, but maybe it could give me more information about the topic.

Actually, to serialize lambda functions, pandarallel uses dill. Because dill is very slow compared to classical Python serialisation, pandarallel uses dill only to serialise the function to apply, the rest (dataframe and all) are serialized with standard Python serialisation.

But, unfortunately in your case, the function to apply is huge, because it contains model.

Could you also tell me how much RAM do you have, the RAM usage during your pandarallel call. And if you have time, could you try with only 2 workers ? (or even 1 worker. Of course 1 worker is useless compared to classical pandas, but at least it uses pandarallel mechanism).

My guesses are the following :

Either pandarallel is working, but the serialization of your model takes a long time, so the function to apply is not yet totally received by worker processes (progress bars really begins to go when some data are treated by workers. During (de)serialization they stay to 0%)
Either you run out of memory. pandarallel is optimized to consume as few RAM as possible concerning the dataframe, but the function to apply is copied n times in memory if you have n worker. Usually the function itself is very light.

alberduris commented 4 years ago

Hi @nalepae, thank you for your detailed and fast answer.

First could tell me if this issue arises also with classical pandas (if no, we are sure it is exclusively a pandarallel issue)

Yes, if I replace the parallel_apply function with the standard apply function everything works correctly (but slow)

Could you also please try without progress bar and without using memory filesystem ? (pandarallel.initialize(use_memory_fs=False)).

Thank for the suggestions. Same behaviour.

Could you also tell me how much RAM do you have, the RAM usage during your pandarallel call.

This is the output of free -m during the pandarallel call. I think that free RAM is not the problem.

              total        used        free      shared  buff/cache   available
Mem:         257672       52909        9537          51      195225      203649
Swap:          4095         590        3505

And if you have time, could you try with only 2 workers ? (or even 1 worker. Of course 1 worker is useless compared to classical pandas, but at least it uses pandarallel mechanism).

I have just tried setting nb_workers=1 and nothing changes.

INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use standard multiprocessing data tranfer (pipe) to transfer data between the main process and workers.
   0.00%                                          |        0 /        6 |

Please, tell me whatever you need and thanks again.

tehkirill commented 4 years ago

Also ran into the issue, took forever to debug, as the argument itself was actually part of self ... Still have lots of RAM, so the serialization guess seems to be spot on, considering that at KeyboardInterrupt the Traceback mostly goes into dill and pickle.

Here is reproducible code:

import numpy as np
import pandas as pd
from pandarallel import pandarallel

pandarallel.initialize(nb_workers=1, use_memory_fs=False)

class A:
    def __init__(self, var1):
        self.var1 = var1

    def f(self, *args):
        pass

    def run(self):
        df = pd.DataFrame(dict(a=np.random.rand(100)))
        df.apply(lambda x: self.f(x), axis=1)
        print("apply is ok")
        df.parallel_apply(lambda x: self.f(x), axis=1)  # hangs if self.var1 is too big
        print("parallel is ok")

if __name__=="__main__":
    a_list = [1]*1024*1024*1024
    a = A(a_list)
    a.run()

Produces:

INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
apply is ok

And hangs...

Appreciate your work! @nalepae

biebiep commented 4 years ago

Currently fixed by upgrading python to 3.7.6 from 3.7.4, apparently the problem was with pickle.

Lolologist commented 3 years ago

For those who seek why a single process is running indefinitely with no results: I was on 3.6.4 and upgrading to 3.7.6 fixed the issue. Still no luck with progress bars, sadly.

tshu-w commented 2 years ago

I got around this by setting the function parameters to global variables.

shermansiu commented 7 months ago

For me, it also hangs with tehkirill's example, due to the following line of code: https://github.com/nalepae/pandarallel/blob/261a652cddb219ac353ff803e81646c08b72fc6f/pandarallel/core.py#L366

You can reduce the slow code to the following code snippet:

import dill

a_list = [1]*1024*1024*1024
a_list_str = dill.dumps(a_list)

I'm not sure why the closure needs to use the pickled version of the function and not the original one.

shermansiu commented 7 months ago

(Okay, I get that you need to pickle the function when passing the function using pipe or the memory file system, but why not just keep the function in-memory and call multiprocessing directly over that? What are the limitations of that approach?)

nalepae / pandarallel

Processes stopped when passing large objects to function to be parallelized #68