nalepae / pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs
https://nalepae.github.io/pandarallel
BSD 3-Clause "New" or "Revised" License
3.65k stars 211 forks source link

parallel_apply never starts processing #122

Open pablokvitca opened 3 years ago

pablokvitca commented 3 years ago

ISSUE: Progress on the parallel_apply never starts going up.

I am trying to use parallel_apply to populate new columns on a data frame. This takes about 50 minutes with normal apply, but every column is independent so it should be easily parallelizable.

I am using the following to initialize:

pandarallel.initialize(nb_workers=8, progress_bar=True, use_memory_fs=False)

OUTPUT:

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

and this is my parallel_apply call:

allowed_types_list = ['...', '...', ..., '...']
data["allowed"] = data["type"].apply(lambda x: 1 if x in allowed_types_list else 0)

The shape of my dataframe is: (4717892, 8)

ISSUE: Progress on the parallel_apply never starts going up.

I tried similarly on a different function that takes around 5 second on apply, and same thing happens. I tried it on my local computer (running MacOS with an i9, using pipe for data transfer) and on Google Colab (here I had 4 cores, using memory file system for data transfer). Same behavior on both.

Am I missing something?

As a side note, is it possible to get the progress bars working on Google Colab?

BrannonKing commented 3 years ago

For your last question: https://stackoverflow.com/questions/64754814/pandarallel-widgets-dont-work-on-google-colab

MohitJuneja commented 3 years ago

@pablokvitca Could you try initializing without the progress_bar? I faced a similar issue and was able to run pandarallel without the progress_bar. If you are using jupyter notebook (since you were looking for colab), you can use the magic %time to see the time taken for the process.

pandarallel.initialize(nb_workers=8, use_memory_fs=False)

MSDuncan82 commented 3 years ago

Thanks @MohitJuneja. Setting progress_bar=False fixed the issue for me. This is annoying though because the progress bars are extremely useful. I'm just running this in the terminal. Does anyone know why the progress bars cause the program to hang?

Lolologist commented 3 years ago

I am having the same issue; with progress bars I never actually get the processing to work (checking htop to see CPU usage, there's an immediate spike and then it all drops away). Turning off progress bars (a bummer) does let it work.

slayerjain commented 3 years ago

I'm facing the same problem on an M1 Macbook pro 13. Turning off progress bar doesn't help

RicardoHS commented 2 years ago

Same problem here. Turning off the progress bar works.

It looks the problem starts with big dataframes. If I use less rows then the process (with progress bars) works.

liujiajun commented 2 years ago

Same issue. Any idea why

mateuspestana commented 2 years ago

Same problem on M1 Pro.

Collonville commented 1 year ago

Same issue using pandarallel==1.6.3 on Jupyter Notebook. progress_bar=False worked for me but it cause bad usability.

yangyxt commented 1 year ago

Same issue here using pandarallel==1.6.1, python 3.9.5 pandas 1.4.2. However I encounter this by finding out the cputime of the computation node stop increasing. And I set progress_bar=True, use_memory_fs=False.

shermansiu commented 5 months ago

Possibly related to #75, #88, and #108.

This should be fixed now, but please comment if this issue still applies to you.