soravux / scoop

SCOOP (Scalable COncurrent Operations in Python)
https://github.com/soravux/scoop
GNU Lesser General Public License v3.0
634 stars 87 forks source link

scoop locks up if out of memory #55

Open joernhees opened 7 years ago

joernhees commented 7 years ago

I'm running a series of experiments with scoop on a slurm cluster.

Tonight some of my tasks seem to have run out of memory:

Traceback (most recent call last):
  File "/software/python/2.7.12/lib/python2.7/logging/__init__.py", line 872, in emit
Bad address (bundled/zeromq/src/tcp.cpp:244)
    stream.write(ufs % msg)
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/codecs.py", line 706, in write
    return self.writer.write(data)
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/codecs.py", line 370, in write
    self.stream.write(data)
IOError: [Errno 12] Cannot allocate memory
...
Traceback (most recent call last):
  File "/software/python/2.7.12/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/software/python/2.7.12/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/bootstrap/__main__.py", line 302, in <module>
    b.main()
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/bootstrap/__main__.py", line 92, in main
    self.run()
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/bootstrap/__main__.py", line 290, in run
    futures_startup()
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/bootstrap/__main__.py", line 271, in futures_startup
    run_name="__main__"
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/futures.py", line 64, in _startup
    result = _controller.switch(rootFuture, *args, **kargs)
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/_control.py", line 231, in runController
    future = execQueue.pop()
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/_types.py", line 320, in pop
    self.updateQueue()
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/_types.py", line 343, in updateQueue
    for future in self.socket.recvFuture():
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/_comm/scoopzmq.py", line 279, in recvFuture
    received = self._recv()
  File "/home/hees/graph-pattern-learner/venv/lib/python2.7/site-packages/scoop/_comm/scoopzmq.py", line 188, in _recv
    thisFuture = pickle.loads(msg[1])
IndexError: list index out of range

The main issue here is that it seems as if scoop did not completely terminate, but remains running in a locked up state (0 load) for hours.