Open maxshowarth opened 4 years ago
Experienced this today, progress bar causes all threads to hang and perform no work. It works just fine when progress bar is disabled.
Same here.
What is weird is that progress bars were working for a couple of run and then started causing all thread to hang.
Well I applied some modifications to the code between runs (didn't keep track...) but it still runs ok without progress_bar...
It seems to be a duplicate of #75 ...
I'm using Python 3.8.6 on Manjaro.
Is there a minimally working code example, but with a (small-ish) CSV available?
If this issue is a duplicate of #75, then it should be fixed for now.
Python 3.7.7 Pandarallel 1.4.8
Attempting to use
.parallel_apply
on a fairly large dataframe (13450799 rows x 8 columns) to copy the index value for each row into a new columns. Initially, I had run tests on a subset of the original df without settingnb_workers
orprogress_bar
and the test was successful.When running the code on the larger dataframe, I wanted to monitor progress and set
progress_bar = True
. The operation began and progress proceeded as expected until the % complete for each worker was >99.5%. After that, progress stops indefinitely.To look a little deeper, I monitored the system resources on the remote box using htop. I noticed that once progress seemed to stop, there was no activity on any of the CPUs, and the memory allocated dropped down to a level comparable to when the data frame was loaded, but prior to the operation commencing. Eventually, I interrupted the operation.
Non Functional Code
After removing any options when initializing pandarallel, the operation completed successfully.
Functional Code
Because this dataframe is quite large, and I noticed that most of my memory was being consumed, I decided to try and limit the number of workers and re-test. What I found was that specifying smaller number of workers than the default, eventually a
OSError: [Errno 12] Cannot allocate memory
is thrown. The process does not fail, but does not progress further. It exhibits the same behaviour as the test where bothnb_workers
andprogress_bar
are set.Non Functional Code - Setting Just Workers
I re-did these tests looking at memory usage and noticed that whenever
nb_workers
orprogress_bar
is set, a massive amount of memory is being used regardless of the number of workers.Here are some back of the envelope peak sustained mem consumption info:
nb_workers = NOT SET
&progress_bar = NOT SET
= 54GB [nb_workers
default=16
]nb_workers = 3
&progress_bar = NOT SET
= 59GBnb_workers = 1
&progress_bar = NOT SET
= 53GBnb_workers = 3
&progress_bar = True
= 60GBnb_workers = 8
&progress_bar = True
= 60GB Note the system has 64GB of RAM with ~62GB available at the time each test was run.df.info(memory_usage = 'deep')
shows about 2GB for the entire dataframe for reference.Doing this same operation with a simple
pandas.apply
never consumes more than 20GB of memory.I know that that similar issues have been opened (e.g. #75 #77) but taking their suggested approaches (e.g. updating to Python 3.7.7) do not resolve the issue.
To summarize:
nb_workers
andprogress_bar
are set, operations fail to complete, but do not actually error out, they just hang.nb_workers
andprogress_bar
are not set, operation completes as expected.nb_workers
andprogress_bar
are set regardless of the number of workers specified.nb_workers
is set, butprogress_bar
is not,OSError: [Errno 12] Cannot allocate memory
is thrown but failure is not fatal and process hangs.I am wondering whether there is some kind of memory issue occurring when either
nb_workers
orprogress_bar
is set, but theOSError
is being suppressed whenprogress_bar
is set.