nalepae / pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs
https://nalepae.github.io/pandarallel
BSD 3-Clause "New" or "Revised" License
3.59k stars 208 forks source link

Memory usage increases across multiple `parallel_apply` #264

Open hogan-roblox opened 4 months ago

hogan-roblox commented 4 months ago

General

Acknowledgement

Bug description

If I run continuous data processing tasks, each with a huge DataFrame using parallel_apply , their MEM footprints somehow accumulates.

Observed behavior

My code logic looks like below.

pandarallel.initialize(progress_bar=True, nb_workers=120)

for file_path in file_paths:
    df = pd.read_csv(file_path)
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(SOME_FUNCTION, axis=1).to_dict(),
        orient="columns",
    )

All tasks should have similar footprints in MEM. However, from the below image, one can tell the MEM drops after the first task is finished but soon climbs back up after loading the second task.

image

Expected behavior

Given that two tasks have similar MEM footprints, I would assume the MEM pattern to be repeated but not accumulated.

Minimal but working code sample to ease bug fix for pandarallel team

As the pseudocode I attached above.

hogan-roblox commented 3 months ago

I have some updates on this -- it seems that pandarallel.initialize(progress_bar=True, nb_workers=120) has to be re-executed between different data frames. Is it expected?

The below updated code somehow solves the issue for me.

for file_path in file_paths:
    pandarallel.initialize(progress_bar=True, nb_workers=120)
    df = pd.read_csv(file_path)
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(SOME_FUNCTION, axis=1).to_dict(),
        orient="columns",
    )
image

This issue is no longer a blocker for me, but I would like to leave open for a while to see if someone else has the same issue and whether this is an expected behavior.

shermansiu commented 2 months ago

Could you please attach a sample CSV and the simplest SOME_FUNCTION for which you can reproduce your error?

I'm unable to reproduce your problems with the memory usage.

Python: 3.10.13 Pandarallel: 1.6.5 Pandas: 2.2.0

import pandas as pd
import pandarallel

pandarallel.pandarallel.initialize(progress_bar=True, nb_workers=120)

for _ in range(10):
    df = pd.DataFrame({"foo": range(100_000)})
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(lambda x: x+1, axis=1).to_dict(),
        orient="columns",
    )

You mentioned that this issue is no longer a blocker for you, so if you don't reply in a while, this issue should probably be closed.