Open kbruegge opened 9 years ago
There is substantial overhead in starting jobs up on SGE (about a minute), so even when it says "running", that may not actually be true. GridMap is intended to be used for tasks that will take at least a few minutes to run, because otherwise the overhead is not in any way worth it. The example is kind of a bad one, because the calculations are so fast, so all you'll notice is the overhead.
If you let it run for like 5 minutes and it still doesn't finish, then there's probably a real issue.
As for JobMonitor, if you want more info you can either set the logging level to DEBUG (which will give you a ton of information), or run gridmap_web
, which will give you a web wrapper around JobMonitor. It isn't very feature-rich yet, so I usually just use JobMonitor with debug logging when things aren't working right.
If you want to know more about how things work, check out this detailed rundown on the wiki.
I'm well aware that the documentation for GridMap could use some work (see #39), but I actually no longer actively use gridmap because I've changed jobs and now work at a company that doesn't use SGE (or any DRMAA-compatible grid). If you want to help out with documentation or by tackling any of the open issues, please make a PR. Thanks for offering!
Thanks for your reply.
My jobs just hit the walltime limit which was at 2 hours. So there seems to be something wrong :)
I also started some jobs with DEBUG log level. The output looks okay as far as I can tell. Don't know
about job_id : -1
. It just repeats the following lines over and over:
.
.
.
2015-04-03 17:23:26,986 - gridmap.runner - DEBUG - Connecting to JobMonitor (tcp://10.194.168.53:61096)
2015-04-03 17:23:26,986 - gridmap.runner - DEBUG - Sending message: {'command': 'heart_beat', 'data': {}, 'ip_address': '10.194.168.53', 'host_name': 'the_host_name', 'job_id': -1}
2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Received message: {'command': 'heart_beat', 'data': {}, 'ip_address': '10.194.168.53', 'host_name': 'the_host_name', 'job_id': -1}
2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Checking if jobs are alive
2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Sending reply:
2015-04-03 17:23:26,987 - gridmap.job - DEBUG - 0 out of 4 jobs completed
2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Waiting for message
.
.
.
If I can get this to work on our clusters I'll gladly contribute to documentation as I go along and figure things out. If this works for what im trying to do then a bunch of people from my group might use it as well.
The job_id: -1
means those messages are actually from the JobMonitor itself. It's how it knows to check if the jobs are alive and if its heard from them. If you don't see any messages from any jobs with IDs other than -1, then it implies that maybe you've got some sort of firewall issue preventing the workers from connecting to the JobMonitor.
I am hitting the exact same issue. Was this ever fixed? How would I see debug info from the worker jobs, to find out if these are firewall issues?
Thanks
Found the issue for my case, leaving some traces in case anyone else comes here:
I am using SGE grid. Checking job status after they finished showed:
$ qacct -j 3555
==============================================================
...
failed 26 : opening input/output file
...
Turned out the default temp_dir (defined as /scratch/ in gridmap.conf) exists but is inaccessible in my case. This error is not caught by _append_job_to_session in job.py.
The default temp_dir can be overridden by passing tmp_dir
as argument to process_jobs
job_outputs = process_jobs(
functionJobs,
max_processes=4,
temp_dir='/path/to/tmp/',
)
I am not sure what the intended way of overriding gridmap.conf default values is.
Running into the same issue as people above me. With the following code:
import gridmap
def foo(x, y):
return x * y
if __name__ == "__main__":
jobs = []
for i in range(10):
job = gridmap.Job(foo, [i, i + 1])
jobs.append(job)
job_outputs = gridmap.process_jobs(jobs, max_processes=4, quiet=False)
print(job_outputs)
The code never reaches the print(job_outputs)
section.
Hello!
I really like your project but I'm having trouble running your example code in
examples\manual.py
. When I run it I get the promising output:The output of qstat also looks fine:
As you can see, the jobs are indicated as (r)unning.
The problem however is that the jobs never actually seem to finish. Which is odd since the calculation should when done locally takes about 10 seconds. As expected since the function
sleep_walk(10)
is being called.I then modified your example to skip the sleep function and write out a file called
test.txt
. But nothing ever happens.Which brings me to my second question. How do I use the JobMonitor feature? I didnt gather much information from your documentation I'm afraid.
Any help is much appreciated. Also if there is any way I can contribute please let me know.
Kai