Pandarallel very slow after loading huge dataframe

lpuglia commented 7 months ago

General

Operating System: Ubuntu
Python version: 3.11
Pandas version: 2.0.3
Pandarallel version: main branch

Acknowledgement

[x] My issue is NOT present when using pandas without alone (without pandarallel)
[x] If I am on Windows, I read the Troubleshooting page before writing a new bug report

Bug description

initialization:

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

def custom_function(row):
    return row['column1']+row['column2']

df2 = pd.read_csv('small.csv') # some MBs files

the following code:

df2['column2'] = df2.parallel_apply(custom_function, axis=1)

takes about 5 seconds to run, while if i do:

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

def custom_function(row):
    return row['column1']+row['column2']

df1 = pd.read_csv('huge.csv') # hundreds of GBs csv file
df2 = pd.read_csv('small.csv') # some MBs files

the following code:

df2['column2'] = df2.parallel_apply(custom_function, axis=1)

takes about 1 minute to run.

Is this by design? why would loading a huge dataframe impact the runtime on the second dataframe? is there some hidden state that i'm not considering? is there a way to avoid the issue?

nalepae commented 5 months ago

Pandaral·lel is looking for a maintainer! If you are interested, please open an GitHub issue.

shermansiu commented 2 months ago

Can you check your RAM usage with htop? It's possible that your computer slowed down because it was holding a huge dataframe in-memory.

nalepae / pandarallel