Open alberduris opened 4 years ago
Hello,
pandas
(if no, we are sure it is exclusively a pandarallel
issue)pandarallel.initialize(use_memory_fs=False)
).I guess it won't work, but maybe it could give me more information about the topic.
Actually, to serialize lambda functions, pandarallel
uses dill
.
Because dill
is very slow compared to classical Python serialisation, pandarallel
uses dill
only to serialise the function to apply, the rest (dataframe and all) are serialized with standard Python serialisation.
But, unfortunately in your case, the function to apply is huge, because it contains model
.
Could you also tell me how much RAM do you have, the RAM usage during your pandarallel
call.
And if you have time, could you try with only 2 workers ? (or even 1 worker. Of course 1 worker is useless compared to classical pandas
, but at least it uses pandarallel
mechanism).
My guesses are the following :
pandarallel
is working, but the serialization of your model
takes a long time, so the function to apply is not yet totally received by worker processes (progress bars really begins to
go when some data are treated by workers. During (de)serialization they stay to 0%)pandarallel
is optimized to consume as few RAM as possible concerning the dataframe, but the function to apply is copied n
times in memory if you have n
worker. Usually the function itself is very light.Hi @nalepae, thank you for your detailed and fast answer.
- First could tell me if this issue arises also with classical pandas (if no, we are sure it is exclusively a pandarallel issue)
Yes, if I replace the parallel_apply
function with the standard apply
function everything works correctly (but slow)
- Could you also please try without progress bar and without using memory filesystem ? (pandarallel.initialize(use_memory_fs=False)).
Thank for the suggestions. Same behaviour.
Could you also tell me how much RAM do you have, the RAM usage during your pandarallel call.
This is the output of free -m
during the pandarallel call. I think that free RAM is not the problem.
total used free shared buff/cache available
Mem: 257672 52909 9537 51 195225 203649
Swap: 4095 590 3505
And if you have time, could you try with only 2 workers ? (or even 1 worker. Of course 1 worker is useless compared to classical pandas, but at least it uses pandarallel mechanism).
I have just tried setting nb_workers=1
and nothing changes.
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use standard multiprocessing data tranfer (pipe) to transfer data between the main process and workers.
0.00% | 0 / 6 |
Please, tell me whatever you need and thanks again.
Also ran into the issue, took forever to debug, as the argument itself was actually part of self
...
Still have lots of RAM, so the serialization guess seems to be spot on, considering that at KeyboardInterrupt
the Traceback mostly goes into dill
and pickle
.
Here is reproducible code:
import numpy as np
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=1, use_memory_fs=False)
class A:
def __init__(self, var1):
self.var1 = var1
def f(self, *args):
pass
def run(self):
df = pd.DataFrame(dict(a=np.random.rand(100)))
df.apply(lambda x: self.f(x), axis=1)
print("apply is ok")
df.parallel_apply(lambda x: self.f(x), axis=1) # hangs if self.var1 is too big
print("parallel is ok")
if __name__=="__main__":
a_list = [1]*1024*1024*1024
a = A(a_list)
a.run()
Produces:
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
apply is ok
And hangs...
Appreciate your work! @nalepae
Currently fixed by upgrading python to 3.7.6 from 3.7.4, apparently the problem was with pickle.
For those who seek why a single process is running indefinitely with no results: I was on 3.6.4 and upgrading to 3.7.6 fixed the issue. Still no luck with progress bars, sadly.
I got around this by setting the function parameters to global variables.
For me, it also hangs with tehkirill's example, due to the following line of code: https://github.com/nalepae/pandarallel/blob/261a652cddb219ac353ff803e81646c08b72fc6f/pandarallel/core.py#L366
You can reduce the slow code to the following code snippet:
import dill
a_list = [1]*1024*1024*1024
a_list_str = dill.dumps(a_list)
I'm not sure why the closure needs to use the pickled version of the function and not the original one.
(Okay, I get that you need to pickle the function when passing the function using pipe or the memory file system, but why not just keep the function in-memory and call multiprocessing directly over that? What are the limitations of that approach?)
Problem:
Apply a NLP Deep Learning model for Text Geneartion over the rows of a Pandas Series. The function call is:
out = text_column.parallel_apply(lambda x: generate_text(args, model, tokenizer, x))
where
args
,tokenizer
are light objects butmodel
is a heavy object, storing a Pytorch model which weighs more than 6GB on secondary memory and takes up ~12GB RAM when running it.I have been doing some tests and the problem arises only when I pass the heavy model to the function (even without effectively running it inside the function), so it seems that the problem is passing an object as argument that takes up a lot of memory. (Maybe related with the Sharing Memory strategy for parallel computing.)
After running the
parallel_apply
the output I get is:And it gets stuck there forever. Indeed, there are two processed spawned and both are stopped: