hotpotqa / hotpot

Apache License 2.0
445 stars 75 forks source link

Preprocessing error: joblib.externals.loky.process_executor.BrokenProcessPool: A process in the executor was terminated abruptly, the pool is not usable anymore #14

Closed supritashankar closed 5 years ago

supritashankar commented 5 years ago

When I run the preprocessing step - the job always fail at this step (after processing 64,000 questions)

Has anybody else faced this issue?

[Parallel(n_jobs=8)]: Batch computation too slow (2.0377s.) Setting batch_size=2.
exception calling callback for <Future at 0x7f3526427e48 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "/root/anaconda3/envs/local_nmt/lib/python3.5/site-packages/joblib/externals/loky/_base.py", line 322, in _invoke_callbacks
    callback(self)
  File "/root/anaconda3/envs/local_nmt/lib/python3.5/site-packages/joblib/parallel.py", line 375, in __call__
    self.parallel.dispatch_next()
  File "/root/anaconda3/envs/local_nmt/lib/python3.5/site-packages/joblib/parallel.py", line 797, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/root/anaconda3/envs/local_nmt/lib/python3.5/site-packages/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/root/anaconda3/envs/local_nmt/lib/python3.5/site-packages/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/root/anaconda3/envs/local_nmt/lib/python3.5/site-packages/joblib/_parallel_backends.py", line 506, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/root/anaconda3/envs/local_nmt/lib/python3.5/site-packages/joblib/externals/loky/reusable_executor.py", line 151, in submit
    fn, *args, **kwargs)
  File "/root/anaconda3/envs/local_nmt/lib/python3.5/site-packages/joblib/externals/loky/process_executor.py", line 990, in submit
    raise BrokenProcessPool(self._flags.broken)
joblib.externals.loky.process_executor.BrokenProcessPool: A process in the executor was terminated abruptly, the pool is not usable anymore.
supritashankar commented 5 years ago

Any ideas @kimiyoung?

qipeng commented 5 years ago

One possibility is that you're running out of memory at a certain step. Could you try reducing the number of parallel workers (in the Parallel call) and/or monitoring your RAM usage when this happens?

supritashankar commented 5 years ago

Hi Peng, Thank you for your reply! You are right! When I check the top I see a sharp fall in memory available.

top - 11:40:25 up 2 days, 18:35,  4 users,  load average: 7.09, 7.56, 7.13
Tasks:  32 total,   3 running,  29 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.6 us,  2.6 sy,  0.0 ni, 89.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 33554432 total,       40 free, 33520996 used,    33396 buff/cache
KiB Swap:        0 total,        0 free,        0 used.       40 avail Mem

I tried running it with parallelism 2 outputs = Parallel(n_jobs=2, verbose=10)(delayed(_process_article)(article, config) for article in data) - but it still fails!

Surprisingly, when I run it mac - it is much slower but it does not get killed and gets through the whole preprocessing. My Mac only has 8 cores. Whereas the gpu machine looks like this

CPU(s):                80
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
qipeng commented 5 years ago

In that case I would probably try something larger than 2 but smaller than the original number of jobs you tried originally (that failed) -- that should probably give a good balance between speed and memory usage!

Closing for now since we have identified the issue, feel free to reopen/open a new issue if something else in preprocessing fails for you!