Open cancan101 opened 9 years ago
With more logging:
sending function jobs to cluster
2015-07-22 07:16:39,356 - gridmap.job - INFO - Setting up JobMonitor on tcp://10.1.3.165:35470
2015-07-22 07:16:39,771 - gridmap.job - DEBUG - Starting local hearbeat
2015-07-22 07:16:39,773 - gridmap.job - DEBUG - Starting ZMQ event loop
2015-07-22 07:16:39,773 - gridmap.job - DEBUG - 0 out of 4 jobs completed
2015-07-22 07:16:39,773 - gridmap.job - DEBUG - Waiting for message
2015-07-22 07:16:39,776 - gridmap.runner - DEBUG - Connecting to JobMonitor (tcp://10.1.3.165:35470)
2015-07-22 07:16:39,777 - gridmap.runner - DEBUG - Sending message: {u'command': u'heart_beat', u'ip_address': '10.1.3.165', u'job_id': -1, u'data': {}, u'host_name': 'ip-10-1-3-165'}
2015-07-22 07:16:39,777 - gridmap.job - DEBUG - Received message: {u'host_name': 'ip-10-1-3-165', u'ip_address': '10.1.3.165', u'command': u'heart_beat', u'job_id': -1, u'data': {}}
2015-07-22 07:16:39,777 - gridmap.job - DEBUG - Checking if jobs are alive
2015-07-22 07:16:39,777 - gridmap.job - DEBUG - Sending reply:
2015-07-22 07:16:39,778 - gridmap.job - DEBUG - 0 out of 4 jobs completed
2015-07-22 07:16:39,778 - gridmap.job - DEBUG - Waiting for message
It's not hanging then. You should check qstat
to make sure that the jobs actually got started. If they start and disappear right away, that means there is probably an unpickling problem with the job (or a firewall issue where the workers can't talk to JobMonitor
). You can investigate that by logging into those machines and looking in whatever directory you have specified for temp_dir
(it defaults to /scratch/
).
It looks like the workers are trying to load drmaa
and failing to do so:
Traceback (most recent call last):
File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/runpy.py", line 151, in _run_module_as_main
mod_name, loader, code, fname = _get_module_details(mod_name)
File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/runpy.py", line 101, in _get_module_details
loader = get_loader(mod_name)
File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/pkgutil.py", line 464, in get_loader
return find_loader(fullname)
File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/pkgutil.py", line 474, in find_loader
for importer in iter_importers(fullname):
File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/pkgutil.py", line 430, in iter_importers
__import__(pkg)
File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/gridmap/__init__.py", line 69, in <module>
from gridmap.conf import (CHECK_FREQUENCY, CREATE_PLOTS, DEFAULT_QUEUE,
File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/gridmap/conf.py", line 76, in <module>
import drmaa
File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/__init__.py", line 63, in <module>
from .session import JobInfo, JobTemplate, Session
File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/session.py", line 39, in <module>
from drmaa.helpers import (adapt_rusage, Attribute, attribute_names_iterator,
File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/helpers.py", line 36, in <module>
from drmaa.wrappers import (drmaa_attr_names_t, drmaa_attr_values_t,
File "/home/ec2-user/.pyenv/versions/venv2710/lib/python2.7/site-packages/drmaa/wrappers.py", line 56, in <module>
_lib = CDLL(libpath, mode=RTLD_GLOBAL)
File "/home/ec2-user/.pyenv/versions/2.7.10/lib/python2.7/ctypes/__init__.py", line 365, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /opt/sge/lib//libdrmaa.so: cannot open shared object file: No such file or directory
which is odd since the env looks like DRMAA_LIBRARY_PATH=/opt/sge/lib/lx-amd64/libdrmaa.so
(and that file is available on the workers).
Why do the workers need to load drmaa
?
Weird. I've never seen that raise an OSError
when failing to import before. I've updated the code (5883a778d3f620727b8b9a60966d832e010e6be5) so that that won't happen in the future.
As you point out, the workers shouldn't need drmaa
.
Okay. That explained the hanging.
That being said, now I am not seeing any warnings in the log due to this:
No handlers could be found for logger "gridmap.conf"
Do I need to make sure to open certain ports?