hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Add Slurm cluster system #4

Closed EricR86 closed 5 years ago

EricR86 commented 10 years ago

Original report (BitBucket issue) by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


From Google Code Issue #27

Imported Labels: enhancement, imported, Priority-Medium

From cconn...@fhcrc.org on May 01, 2013 12:51:33

Hi, Would it be possible to add the ability to use SLURM with Segway? SLURM has a DRMAA implementation. I am willing to help.

Sincerely,

Chuck Connolly

Original issue: http://code.google.com/p/segway-genome/issues/detail?id=27

EricR86 commented 10 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


From hoffman...@gmail.com on May 05, 2013 19:57:43

Sure! There are basically a couple of things we need to do.

First, what is the value of Session.drmsInfo on SLURM? (On LSF it begins with "Platform LSF", on Grid Engine it is "GE" or "SGE" or "UGE".)

We can switch on that value in segway/cluster/init.py to use a different driver file (segway/cluster/slurm.py) to manage the cluster (current ones in segway are segway/cluster/sge.py, segway/cluster/lsf.py, and segway/cluster/pbs.py). At a minimum, this should contain a JobTemplateFactory class, which is a subclass of segway.cluster.common._JobTemplateFactory. This should have make_res_req() and make_native_spec() methods.

segway.slurm.JobTemplateFactory.make_res_req(self, mem_usage, tmp_usage) should take two arguments, specifying the amount of memory and temp space required for a task respectively in bytes. It should set self.res_req with some value that will later be used by segway.slurm.JobTemplateFactory.make_native_spec().

segway.slurm.JobTemplateFactory.make_native_spec(self) should take no non-self arguments and use the self.res_req to return a string that will go into the DRMAA job template's nativeSpecification field.

Do you think you would be able to make a patch for this? I am a little short on cycles right now to actually do this myself (mainly because I don't have access to a SLURM system at the moment to test on).

Summary: Add Slurm cluster system (was: add ability to use slurm)
Status: Accepted

EricR86 commented 10 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


From cconn...@fhcrc.org on May 07, 2013 10:10:27

Hi, The Session.drmsInfo value I get is 'SLURM 2.5.4'. I can manage the patch.

Chuck

EricR86 commented 10 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


From hoffman...@gmail.com on May 07, 2013 11:42:16

The next release will have a hook for drms_info.startswith("SLURM") in segway.cluster.get_driver_name()

elif drms_info.startswith("SLURM"):
    return "slurm"

So this should work well with a new slurm.py. The attached pbs.py from the next release is the simplest driver and might make a good starting point.

Attachment: pbs.py

EricR86 commented 10 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


From cconn...@fhcrc.org on May 24, 2013 13:06:44

I've attached a slurm.py driver file. This file works to process the test.genomedata on a system running slurm 2.5.4. Please let me know if you need modify it.

Attachment: slurm.py

EricR86 commented 10 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


From hoffman...@gmail.com on May 24, 2013 13:11:30

Thanks so much! I'm glad you were able to make this work. I will add integrating this patch to my to-do list.

EricR86 commented 6 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


Eric, do you still have the above slurm.py file? It was never added to the segway source. There is code in cluster.__init__.py that returns slurm as a driver, but then it doesn't go anywhere. I have a user interested in using the slurm queuing system.

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


@jayhesselberth I found it! https://storage.googleapis.com/google-code-attachments/segway-genome/issue-27/comment-4/slurm.py

I had to use this as reference: https://code.google.com/archive/schema

It's fairly barebones though and I cannot verify if it will work or not. If anyone manages to get it to work we should probably just add it in.

EricR86 commented 6 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


@jayhesselberth I'll actually be addressing this issue soon directly and I was wondering if you remember having any luck with the above slurm.py?

EricR86 commented 5 years ago

Original comment by Kate Cook (Bitbucket: katecook).


I am trying to get segway to work with slurm but it just runs locally silently. SEGWAY_CLUSTER is set and I've installed drmaa. I copied the slurm.py file referenced above to segway/cluster.

I am not 100% sure that python drmaa is working correctly (advice on how to test this is welcome) but Segway doesn't give me any sort of error (as discussed in #52).

This is probably not relevant, but all of the tests in /test pass except for simplebadinput, which fails with the following error:

Error: DiagGaussian 'mc_asinh_norm_seg0_subseg0_testtrack1' in file '../input.master' line 185 specifies mean name 'mean_seg0_subseg0_testtrack1' that does not exist

EricR86 commented 5 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


simplebadinput, as it name implies, is supposed to fail. All is normal. The test explicitly checks for a failure and returns success if it does (exit 0).

In terms of debugging job submission and drmaa, I have yet to go into depth for implementing and debugging on slurm. Here's the following script I've been using to test drmaa on cluster systems. It uses the first passed argument as the job to submit using DRMAA:

#!/usr/bin/env python
import os
import sys

import drmaa

def main():
    with drmaa.Session() as session:
        print("DRMAA Session started successfully")

        print('Supported contact strings:', session.contact)
        print('Supported DRM systems:', session.drmsInfo)
        print('Supported DRMAA implementations:', session.drmaaImplementation)
        print('Version', session.version)

        job_template = session.createJobTemplate()
        job_template.remoteCommand = os.path.join(
            os.getcwd(),
            sys.argv[1]
        )

        job_template.jobEnvironment = os.environ.copy()

        # There is a better way to do this (probably using jobCategory
        # instead), but this is the way segway does it
        # TODO : Look up better specification
        # job_template.nativeSpecification = "-cwd -N " + \
        # sys.argv[1]

        print("Submitting job:", sys.argv[1], job_template.nativeSpecification)
        job_id = session.runJob(job_template)
        print('Job has been submitted with ID %s' % job_id)

        return_value = session.wait(job_id, drmaa.Session.TIMEOUT_WAIT_FOREVER)
        print('Job: {0} finished with status {1}'.format(
            return_value.jobId,
            return_value.hasExited)
        )

        session.deleteJobTemplate(job_template)

if __name__ == "__main__":
    main()

As you can see there are some odd things I still need to add in terms of sensible native slurm specifications for Segway. This script should at least give you some insight into why Segway is not submitting jobs on your behalf.

In the worst case, Segway itself can be submitted as a job and run "locally" (as it is doing right now) and ideally with the number of cores you've reserved with SEGWAY_NUM_LOCAL_JOBS

EricR86 commented 5 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Fixed in Pull Request #95