inspirehep / beard

Bibliographic Entity Automatic Recognition and Disambiguation
Other
66 stars 36 forks source link

clustering: race condition #37

Closed MSusik closed 9 years ago

MSusik commented 9 years ago

When the ScipyHierarchicalClustering is run on many cores, on machine with Intel MKL, sometimes it hits a race condition. All the cores are idle and they take memory resources.

After sending ctrl+c I receive:

Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    task = get()
  File "/home/inspire/.virtualenvs/beard/local/lib/python2.7/site-packages/joblib/pool.py", line 361, in get
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    racquire()
KeyboardInterrupt
    task = get()
  File "/home/inspire/.virtualenvs/beard/local/lib/python2.7/site-packages/joblib/pool.py", line 361, in get
    racquire()

The issue is also mentioned here: https://github.com/joblib/joblib/issues/138

The trick with setting environmental variables is not hepling much. Note that the error appears no matter if the machine uses Anaconda.

glouppe commented 9 years ago

Have you stumbled upon the same issue using Python 3 instead?

MSusik commented 9 years ago

Not yet, but I will try running on 10 processes.

MSusik commented 9 years ago

I run the clustering few times on Python3 with n_jobs=-1 and didn't have any difficulties. It seems to be the way to go. All the cores were used. The issue should remain opened, IMO.

glouppe commented 9 years ago

Cool, one more reason to switch to Python 3 :)

MSusik commented 9 years ago

Unfortunately, I stumpled upon this thing today:

Process PoolWorker-8:
Traceback (most recent call last):
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/site-packages/joblib-0.8.4-py3.3.egg/joblib/parallel.py", line 512, in retrieve
    self._output.append(job.get())
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/multiprocessing/pool.py", line 562, in get
    self.wait(timeout)
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/multiprocessing/pool.py", line 559, in wait
    self._event.wait(timeout)
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/threading.py", line 547, in wait
    signaled = self._cond.wait(timeout)
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/threading.py", line 284, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "strategy3.py", line 794, in <module>
    n_jobs=args.n_jobs).fit(X, y)
  File "/home/msusik/beard/beard/clustering/blocking.py", line 220, in fit
    return self._fit(X, y, blocks)
  File "/home/msusik/beard/beard/clustering/blocking.py", line 185, in _fit
    b, X_mask, y_mask, clusterer in self._blocks(X, y, blocks)))
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/site-packages/joblib-0.8.4-py3.3.egg/joblib/parallel.py", line 660, in __call__
    self.retrieve()
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/site-packages/joblib-0.8.4-py3.3.egg/joblib/parallel.py", line 523, in retrieve
    self._pool.terminate()
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/site-packages/joblib-0.8.4-py3.3.egg/joblib/pool.py", line 586, in terminate
    super(MemmapingPool, self).terminate()
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/multiprocessing/pool.py", line 465, in terminate
    self._terminate()
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/multiprocessing/util.py", line 188, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/multiprocessing/pool.py", line 513, in _terminate_pool
    p.terminate()
  File "/home/msusik/anaconda/envs/py3k/lib/python3.3/multiprocessing/process.py", line 119, in terminate
    self._popen.terminate()
AttributeError: 'NoneType' object has no attribute 'terminate

The exception on the bottom was caused by a ^C from me, but the top one shows a race condition.

glouppe commented 9 years ago

Fixed by #65