Open hogan-roblox opened 4 months ago
I have some updates on this -- it seems that pandarallel.initialize(progress_bar=True, nb_workers=120)
has to be re-executed between different data frames. Is it expected?
The below updated code somehow solves the issue for me.
for file_path in file_paths:
pandarallel.initialize(progress_bar=True, nb_workers=120)
df = pd.read_csv(file_path)
df = pd.DataFrame.from_dict(
df.sample(frac=1.0).parallel_apply(SOME_FUNCTION, axis=1).to_dict(),
orient="columns",
)
This issue is no longer a blocker for me, but I would like to leave open for a while to see if someone else has the same issue and whether this is an expected behavior.
Could you please attach a sample CSV and the simplest SOME_FUNCTION
for which you can reproduce your error?
I'm unable to reproduce your problems with the memory usage.
Python: 3.10.13 Pandarallel: 1.6.5 Pandas: 2.2.0
import pandas as pd
import pandarallel
pandarallel.pandarallel.initialize(progress_bar=True, nb_workers=120)
for _ in range(10):
df = pd.DataFrame({"foo": range(100_000)})
df = pd.DataFrame.from_dict(
df.sample(frac=1.0).parallel_apply(lambda x: x+1, axis=1).to_dict(),
orient="columns",
)
You mentioned that this issue is no longer a blocker for you, so if you don't reply in a while, this issue should probably be closed.
General
Acknowledgement
pandas
without alone (withoutpandarallel
)Bug description
If I run continuous data processing tasks, each with a huge DataFrame using
parallel_apply
, their MEM footprints somehow accumulates.Observed behavior
My code logic looks like below.
All tasks should have similar footprints in MEM. However, from the below image, one can tell the MEM drops after the first task is finished but soon climbs back up after loading the second task.
Expected behavior
Given that two tasks have similar MEM footprints, I would assume the MEM pattern to be repeated but not accumulated.
Minimal but working code sample to ease bug fix for
pandarallel
teamAs the pseudocode I attached above.