Memory usage increases across multiple `parallel_apply`

hogan-roblox commented 4 months ago

General

Operating System: Linux
Python version: 3.10.8
Pandas version: 1.5.3
Pandarallel version: 1.6.5

Acknowledgement

[x] My issue is NOT present when using pandas without alone (without pandarallel)
[x] If I am on Windows, I read the Troubleshooting page before writing a new bug report

Bug description

If I run continuous data processing tasks, each with a huge DataFrame using parallel_apply , their MEM footprints somehow accumulates.

Observed behavior

My code logic looks like below.

pandarallel.initialize(progress_bar=True, nb_workers=120)

for file_path in file_paths:
    df = pd.read_csv(file_path)
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(SOME_FUNCTION, axis=1).to_dict(),
        orient="columns",
    )

All tasks should have similar footprints in MEM. However, from the below image, one can tell the MEM drops after the first task is finished but soon climbs back up after loading the second task.

Expected behavior

Given that two tasks have similar MEM footprints, I would assume the MEM pattern to be repeated but not accumulated.

Minimal but working code sample to ease bug fix for `pandarallel` team

As the pseudocode I attached above.

hogan-roblox commented 3 months ago

I have some updates on this -- it seems that pandarallel.initialize(progress_bar=True, nb_workers=120) has to be re-executed between different data frames. Is it expected?

The below updated code somehow solves the issue for me.

for file_path in file_paths:
    pandarallel.initialize(progress_bar=True, nb_workers=120)
    df = pd.read_csv(file_path)
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(SOME_FUNCTION, axis=1).to_dict(),
        orient="columns",
    )

This issue is no longer a blocker for me, but I would like to leave open for a while to see if someone else has the same issue and whether this is an expected behavior.

shermansiu commented 2 months ago

Could you please attach a sample CSV and the simplest SOME_FUNCTION for which you can reproduce your error?

I'm unable to reproduce your problems with the memory usage.

Python: 3.10.13 Pandarallel: 1.6.5 Pandas: 2.2.0

import pandas as pd
import pandarallel

pandarallel.pandarallel.initialize(progress_bar=True, nb_workers=120)

for _ in range(10):
    df = pd.DataFrame({"foo": range(100_000)})
    df = pd.DataFrame.from_dict(
        df.sample(frac=1.0).parallel_apply(lambda x: x+1, axis=1).to_dict(),
        orient="columns",
    )

You mentioned that this issue is no longer a blocker for you, so if you don't reply in a while, this issue should probably be closed.

nalepae / pandarallel