nalepae / pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs
https://nalepae.github.io/pandarallel
BSD 3-Clause "New" or "Revised" License
3.59k stars 208 forks source link

Pandarallel very slow after loading huge dataframe #256

Open lpuglia opened 7 months ago

lpuglia commented 7 months ago

General

Acknowledgement

Bug description

initialization:

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

def custom_function(row):
    return row['column1']+row['column2']

df2 = pd.read_csv('small.csv') # some MBs files

the following code:

df2['column2'] = df2.parallel_apply(custom_function, axis=1)

takes about 5 seconds to run, while if i do:

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

def custom_function(row):
    return row['column1']+row['column2']

df1 = pd.read_csv('huge.csv') # hundreds of GBs csv file
df2 = pd.read_csv('small.csv') # some MBs files

the following code:

df2['column2'] = df2.parallel_apply(custom_function, axis=1)

takes about 1 minute to run.

Is this by design? why would loading a huge dataframe impact the runtime on the second dataframe? is there some hidden state that i'm not considering? is there a way to avoid the issue?

nalepae commented 5 months ago

Pandaral·lel is looking for a maintainer! If you are interested, please open an GitHub issue.

shermansiu commented 2 months ago

Can you check your RAM usage with htop? It's possible that your computer slowed down because it was holding a huge dataframe in-memory.