nalepae / pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs
https://nalepae.github.io/pandarallel
BSD 3-Clause "New" or "Revised" License
3.68k stars 211 forks source link

.parallel_apply fails to complete and hangs with all threads at >99% completion when progress_bar = True #88

Open maxshowarth opened 4 years ago

maxshowarth commented 4 years ago

Python 3.7.7 Pandarallel 1.4.8

Attempting to use .parallel_apply on a fairly large dataframe (13450799 rows x 8 columns) to copy the index value for each row into a new columns. Initially, I had run tests on a subset of the original df without setting nb_workers or progress_bar and the test was successful.

When running the code on the larger dataframe, I wanted to monitor progress and set progress_bar = True. The operation began and progress proceeded as expected until the % complete for each worker was >99.5%. After that, progress stops indefinitely.

To look a little deeper, I monitored the system resources on the remote box using htop. I noticed that once progress seemed to stop, there was no activity on any of the CPUs, and the memory allocated dropped down to a level comparable to when the data frame was loaded, but prior to the operation commencing. Eventually, I interrupted the operation.

Non Functional Code

from pandarallel import pandarallel
pandarallel.initialize(nb_workers = 3, progress_bar=True)

def getIndexName(row):
    return row.name

df['indexName'] = df.parallel_apply(getIndexName, axis=1)

After removing any options when initializing pandarallel, the operation completed successfully.

Functional Code

from pandarallel import pandarallel
pandarallel.initialize()

def getIndexName(row):
    return row.name

df['indexName'] = df.parallel_apply(getIndexName, axis=1)

Because this dataframe is quite large, and I noticed that most of my memory was being consumed, I decided to try and limit the number of workers and re-test. What I found was that specifying smaller number of workers than the default, eventually a OSError: [Errno 12] Cannot allocate memory is thrown. The process does not fail, but does not progress further. It exhibits the same behaviour as the test where both nb_workers and progress_bar are set.

Non Functional Code - Setting Just Workers

from pandarallel import pandarallel
pandarallel.initialize(nb_workers = 3)

def getIndexName(row):
    return row.name

df['indexName'] = df.parallel_apply(getIndexName, axis=1)

I re-did these tests looking at memory usage and noticed that whenever nb_workers or progress_bar is set, a massive amount of memory is being used regardless of the number of workers.

Here are some back of the envelope peak sustained mem consumption info:

df.info(memory_usage = 'deep') shows about 2GB for the entire dataframe for reference.

Doing this same operation with a simple pandas.apply never consumes more than 20GB of memory.

I know that that similar issues have been opened (e.g. #75 #77) but taking their suggested approaches (e.g. updating to Python 3.7.7) do not resolve the issue.

To summarize:

  1. When nb_workers and progress_bar are set, operations fail to complete, but do not actually error out, they just hang.
  2. When nb_workers and progress_bar are not set, operation completes as expected.
  3. Memory usage seems unusually high whenever nb_workers and progress_bar are set regardless of the number of workers specified.
  4. When nb_workers is set, but progress_bar is not, OSError: [Errno 12] Cannot allocate memory is thrown but failure is not fatal and process hangs.

I am wondering whether there is some kind of memory issue occurring when either nb_workers or progress_bar is set, but the OSError is being suppressed when progress_bar is set.

jparr721 commented 4 years ago

Experienced this today, progress bar causes all threads to hang and perform no work. It works just fine when progress bar is disabled.

daxid commented 3 years ago

Same here.

What is weird is that progress bars were working for a couple of run and then started causing all thread to hang.

Well I applied some modifications to the code between runs (didn't keep track...) but it still runs ok without progress_bar...

daxid commented 3 years ago

It seems to be a duplicate of #75 ...

I'm using Python 3.8.6 on Manjaro.

shermansiu commented 6 months ago

Is there a minimally working code example, but with a (small-ish) CSV available?

If this issue is a duplicate of #75, then it should be fixed for now.