pygridtools / gridmap

Easily map Python functions onto a cluster using a DRMAA-compatible grid engine like Sun Grid Engine (SGE).
GNU General Public License v3.0
83 stars 34 forks source link

Jobs indicated as running never actually start. #42

Open kbruegge opened 9 years ago

kbruegge commented 9 years ago

Hello!

I really like your project but I'm having trouble running your example code in examples\manual.py. When I run it I get the promising output:

=====================================
========   Submit and Wait   ========
=====================================

sending function jobs to cluster.
2015-04-03 16:19:05,742 - gridmap.job - INFO - Setting up JobMonitor on tcp://10.194.168.53:52713

The output of qstat also looks fine:

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 423383 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 queue_name@somecluster.com     1
 423384 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 queue_name@somecluster.com     1
 423385 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 queue_name@somecluster.com     1
 423386 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 queue_name@somecluster.com     1

As you can see, the jobs are indicated as (r)unning.

The problem however is that the jobs never actually seem to finish. Which is odd since the calculation should when done locally takes about 10 seconds. As expected since the function sleep_walk(10) is being called.

I then modified your example to skip the sleep function and write out a file called test.txt. But nothing ever happens.

Which brings me to my second question. How do I use the JobMonitor feature? I didnt gather much information from your documentation I'm afraid.

Any help is much appreciated. Also if there is any way I can contribute please let me know.

Kai

dan-blanchard commented 9 years ago

There is substantial overhead in starting jobs up on SGE (about a minute), so even when it says "running", that may not actually be true. GridMap is intended to be used for tasks that will take at least a few minutes to run, because otherwise the overhead is not in any way worth it. The example is kind of a bad one, because the calculations are so fast, so all you'll notice is the overhead.

If you let it run for like 5 minutes and it still doesn't finish, then there's probably a real issue.

As for JobMonitor, if you want more info you can either set the logging level to DEBUG (which will give you a ton of information), or run gridmap_web, which will give you a web wrapper around JobMonitor. It isn't very feature-rich yet, so I usually just use JobMonitor with debug logging when things aren't working right.

If you want to know more about how things work, check out this detailed rundown on the wiki.

I'm well aware that the documentation for GridMap could use some work (see #39), but I actually no longer actively use gridmap because I've changed jobs and now work at a company that doesn't use SGE (or any DRMAA-compatible grid). If you want to help out with documentation or by tackling any of the open issues, please make a PR. Thanks for offering!

kbruegge commented 9 years ago

Thanks for your reply. My jobs just hit the walltime limit which was at 2 hours. So there seems to be something wrong :) I also started some jobs with DEBUG log level. The output looks okay as far as I can tell. Don't know about job_id : -1. It just repeats the following lines over and over:

    .
    .
    .
    2015-04-03 17:23:26,986 - gridmap.runner - DEBUG - Connecting to JobMonitor (tcp://10.194.168.53:61096)
    2015-04-03 17:23:26,986 - gridmap.runner - DEBUG - Sending message: {'command': 'heart_beat', 'data': {}, 'ip_address': '10.194.168.53', 'host_name': 'the_host_name', 'job_id': -1}
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Received message: {'command': 'heart_beat', 'data': {}, 'ip_address': '10.194.168.53', 'host_name': 'the_host_name', 'job_id': -1}
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Checking if jobs are alive
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Sending reply:
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - 0 out of 4 jobs completed
    2015-04-03 17:23:26,987 - gridmap.job - DEBUG - Waiting for message
    .
    .
    .

If I can get this to work on our clusters I'll gladly contribute to documentation as I go along and figure things out. If this works for what im trying to do then a bunch of people from my group might use it as well.

dan-blanchard commented 9 years ago

The job_id: -1 means those messages are actually from the JobMonitor itself. It's how it knows to check if the jobs are alive and if its heard from them. If you don't see any messages from any jobs with IDs other than -1, then it implies that maybe you've got some sort of firewall issue preventing the workers from connecting to the JobMonitor.

djoffe commented 6 years ago

I am hitting the exact same issue. Was this ever fixed? How would I see debug info from the worker jobs, to find out if these are firewall issues?

Thanks

djoffe commented 6 years ago

Found the issue for my case, leaving some traces in case anyone else comes here:

I am using SGE grid. Checking job status after they finished showed:

$ qacct -j 3555
==============================================================
...
failed       26  : opening input/output file
...

Turned out the default temp_dir (defined as /scratch/ in gridmap.conf) exists but is inaccessible in my case. This error is not caught by _append_job_to_session in job.py. The default temp_dir can be overridden by passing tmp_dir as argument to process_jobs

    job_outputs = process_jobs(
        functionJobs,
        max_processes=4,
        temp_dir='/path/to/tmp/',
    )

I am not sure what the intended way of overriding gridmap.conf default values is.

kalkairis commented 5 years ago

Running into the same issue as people above me. With the following code:

import gridmap

def foo(x, y):
    return x * y

if __name__ == "__main__":
    jobs = []

    for i in range(10):
        job = gridmap.Job(foo, [i, i + 1])
        jobs.append(job)
    job_outputs = gridmap.process_jobs(jobs, max_processes=4, quiet=False)
    print(job_outputs)

The code never reaches the print(job_outputs) section.