Setting up JobMonitor Hanging

cancan101 commented 9 years ago

=====================================
========   Submit and Wait   ========
=====================================

sending function jobs to cluster

2015-07-21 18:15:13,077 - gridmap.job - INFO - Setting up JobMonitor on tcp://X.X.X.X:37905

Do I need to make sure to open certain ports?

cancan101 commented 9 years ago

With more logging:

sending function jobs to cluster

2015-07-22 07:16:39,356 - gridmap.job - INFO - Setting up JobMonitor on tcp://10.1.3.165:35470
2015-07-22 07:16:39,771 - gridmap.job - DEBUG - Starting local hearbeat
2015-07-22 07:16:39,773 - gridmap.job - DEBUG - Starting ZMQ event loop
2015-07-22 07:16:39,773 - gridmap.job - DEBUG - 0 out of 4 jobs completed
2015-07-22 07:16:39,773 - gridmap.job - DEBUG - Waiting for message
2015-07-22 07:16:39,776 - gridmap.runner - DEBUG - Connecting to JobMonitor (tcp://10.1.3.165:35470)
2015-07-22 07:16:39,777 - gridmap.runner - DEBUG - Sending message: {u'command': u'heart_beat', u'ip_address': '10.1.3.165', u'job_id': -1, u'data': {}, u'host_name': 'ip-10-1-3-165'}
2015-07-22 07:16:39,777 - gridmap.job - DEBUG - Received message: {u'host_name': 'ip-10-1-3-165', u'ip_address': '10.1.3.165', u'command': u'heart_beat', u'job_id': -1, u'data': {}}
2015-07-22 07:16:39,777 - gridmap.job - DEBUG - Checking if jobs are alive
2015-07-22 07:16:39,777 - gridmap.job - DEBUG - Sending reply: 
2015-07-22 07:16:39,778 - gridmap.job - DEBUG - 0 out of 4 jobs completed
2015-07-22 07:16:39,778 - gridmap.job - DEBUG - Waiting for message

dan-blanchard commented 9 years ago

It's not hanging then. You should check qstat to make sure that the jobs actually got started. If they start and disappear right away, that means there is probably an unpickling problem with the job (or a firewall issue where the workers can't talk to JobMonitor). You can investigate that by logging into those machines and looking in whatever directory you have specified for temp_dir (it defaults to /scratch/).

cancan101 commented 9 years ago

It looks like the workers are trying to load drmaa and failing to do so:

Traceback (most recent call last):
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/runpy.py", line 151, in _run_module_as_main
    mod_name, loader, code, fname = _get_module_details(mod_name)
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/runpy.py", line 101, in _get_module_details
    loader = get_loader(mod_name)
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/pkgutil.py", line 464, in get_loader
    return find_loader(fullname)
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/pkgutil.py", line 474, in find_loader
    for importer in iter_importers(fullname):
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/pkgutil.py", line 430, in iter_importers
    __import__(pkg)
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/gridmap/__init__.py", line 69, in <module>
    from gridmap.conf import (CHECK_FREQUENCY, CREATE_PLOTS, DEFAULT_QUEUE,
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/gridmap/conf.py", line 76, in <module>
    import drmaa
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/__init__.py", line 63, in <module>
    from .session import JobInfo, JobTemplate, Session
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/session.py", line 39, in <module>
    from drmaa.helpers import (adapt_rusage, Attribute, attribute_names_iterator,
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/helpers.py", line 36, in <module>
    from drmaa.wrappers import (drmaa_attr_names_t, drmaa_attr_values_t,
  File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/wrappers.py", line 56, in <module>
    _lib = CDLL(libpath, mode=RTLD_GLOBAL)
  File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /opt/sge/lib//libdrmaa.so: cannot open shared object file: No such file or directory

which is odd since the env looks like DRMAA_LIBRARY_PATH=/opt/sge/lib/lx-amd64/libdrmaa.so (and that file is available on the workers).

Why do the workers need to load drmaa?

dan-blanchard commented 9 years ago

Weird. I've never seen that raise an OSError when failing to import before. I've updated the code (5883a778d3f620727b8b9a60966d832e010e6be5) so that that won't happen in the future.

As you point out, the workers shouldn't need drmaa.

cancan101 commented 9 years ago

Okay. That explained the hanging.

That being said, now I am not seeing any warnings in the log due to this:

No handlers could be found for logger "gridmap.conf"

pygridtools / gridmap

Setting up JobMonitor Hanging #53