thasso / pyjip

JIP Pipeline System
http://pyjip.readthedocs.org
Other
19 stars 8 forks source link

Submitting jobs on a remote cluster #55

Open jdidion opened 10 years ago

jdidion commented 10 years ago

I apologize for asking a question here, but there is no dedicated support forum. The documentation is not clear on how to actually get JIP to submit jobs on a remote cluster. In my environment, and I think that of many others, we ssh into a login node to submit jobs on the cluster. I did not see any option to configure the remote host name or login password in the cluster configuration. Thanks

thasso commented 10 years ago

Hey,

there is currently no way to talk to the remote side directly. The way we use the system on a "remote" cluster is to first ssh to the login node and then install, configure and use jip there. I was in fact thinking about adding a thin "remote" layer on to so you could execute commands directly from your local machine, and we might still implement something like this, but none the less, jip needs to always be installed and available on the remote cluster/login node. The reason for this is that the jobs interact directly with the job database and do not do this through a server. This avoid starting a server on your cluster and you don't need to enable connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your workstation, feel free to explain your use case in a bit more here and we can see if we can do something about it.

jdidion commented 10 years ago

Thanks for the response! The use case I was thinking of is that sometimes I want to run jobs on the cluster and sometimes I want to run them on my local machine. It would be nice to manage both things from my local desktop. If you have all of the commands going over ssh, then I don’t think you need to have anything installed on the remote machine except the scripts that actually get executed (and the data, of course, but it’s beyond the scope of JIP to manage that). The scripts could just be copied to the remote server via scp, or kept in sync via rsync. I’d be happy to contribute to development to make this a part of JIP, but I first I would want to work with you on a plan for the best way to implement it.

Thanks,

John Didion, PhD Postdoctoral Fellow Collins Lab, NHGRI

On Oct 7, 2014, at 8:07 AM, thasso notifications@github.com<mailto:notifications@github.com> wrote:

Hey,

there is currently no way to talk to the remote side directly. The way we use the system on a "remote" cluster is to first ssh to the login node and then install, configure and use jip there. I was in fact thinking about adding a thin "remote" layer on to so you could execute commands directly from your local machine, and we might still implement something like this, but none the less, jip needs to always be installed and available on the remote cluster/login node. The reason for this is that the jobs interact directly with the job database and do not do this through a server. This avoid starting a server on your cluster and you don't need to enable connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your workstation, feel free to explain your use case in a bit more here and we can see if we can do something about it.

— Reply to this email directly or view it on GitHubhttps://github.com/thasso/pyjip/issues/55#issuecomment-58174142.

thasso commented 10 years ago

Hi John,

okay I think I understand the use case and I think it would be a good idea to support a workflow where you can basically migrate jobs between jip instances, say run first locally, then remote. Note that currently there is no easy way to avoid the installation if jip on the remote side. It is needed not only for the actual job execution but also for the creation of the pipeline graph which imo should be create from within the execution environment (the remote side). Keeping it like this means we can rely on the pipeline graph generation and its check to ensure validity of the final graph. In addition, this step ensures that mandatory input files exists etc. Imo it should also be no big problem to install jip in the remote cluster as the installation and execution process happens completely in user spaces. Please note also that there is no need to start any server on the remote side if you have access to a job scheduler like SGE, slurm, or PBS. Only the jip executable (+ dependencies) need to be available.

With this in mind, what I see right now is essentially a SSH wrapper to delegate command executions to a remote side and get back some data that can be processed locally.

What I am not sure about right now is how to deal with input files and if it should be part of jip to take care of syncing the files automatically. But I would probably suggest to start with the SSH wrapper only. File syncing is for sure nice to have, but as you already mentioned, you could also simply use something like rsync.

Best, -Thasso

2014-10-07 14:27 GMT+02:00 John Didion notifications@github.com:

Thanks for the response! The use case I was thinking of is that sometimes I want to run jobs on the cluster and sometimes I want to run them on my local machine. It would be nice to manage both things from my local desktop. If you have all of the commands going over ssh, then I don’t think you need to have anything installed on the remote machine except the scripts that actually get executed (and the data, of course, but it’s beyond the scope of JIP to manage that). The scripts could just be copied to the remote server via scp, or kept in sync via rsync. I’d be happy to contribute to development to make this a part of JIP, but I first I would want to work with you on a plan for the best way to implement it.

Thanks,

John Didion, PhD Postdoctoral Fellow Collins Lab, NHGRI

On Oct 7, 2014, at 8:07 AM, thasso <notifications@github.com<mailto: notifications@github.com>> wrote:

Hey,

there is currently no way to talk to the remote side directly. The way we use the system on a "remote" cluster is to first ssh to the login node and then install, configure and use jip there. I was in fact thinking about adding a thin "remote" layer on to so you could execute commands directly from your local machine, and we might still implement something like this, but none the less, jip needs to always be installed and available on the remote cluster/login node. The reason for this is that the jobs interact directly with the job database and do not do this through a server. This avoid starting a server on your cluster and you don't need to enable connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your workstation, feel free to explain your use case in a bit more here and we can see if we can do something about it.

— Reply to this email directly or view it on GitHub< https://github.com/thasso/pyjip/issues/55#issuecomment-58174142>.

— Reply to this email directly or view it on GitHub https://github.com/thasso/pyjip/issues/55#issuecomment-58176115.

jdidion commented 10 years ago

Sure, I think that workflow will definitely appeal to some users. For me, i’m more interested in having a single interface to execute jobs, whether they be local or remote, but the jobs I run locally and remotely are different (the remote jobs are pipelines for processing sequencing data, while the local jobs are typically quick analyses on the processed data).

The ssh wrapper sounds like the best solution. As far as input files, I think JIP should handle syncing any JIP scripts but (at least for now) it’s up to the user to make sure the data files are in place before executing the job. To me it doesn’t seem ideal to have JIP trying to manage keeping dozens or hundreds of huge bam files in sync.

As far as implementation, can you briefly describe how you would go about doing it? You’re much more familiar with the structure of the code and I want to implement this in the way that makes the most sense.

Thanks

John Didion, PhD Postdoctoral Fellow Collins Lab, NHGRI

On Oct 7, 2014, at 8:55 AM, thasso notifications@github.com<mailto:notifications@github.com> wrote:

Hi John,

okay I think I understand the use case and I think it would be a good idea to support a workflow where you can basically migrate jobs between jip instances, say run first locally, then remote. Note that currently there is no easy way to avoid the installation if jip on the remote side. It is needed not only for the actual job execution but also for the creation of the pipeline graph which imo should be create from within the execution environment (the remote side). Keeping it like this means we can rely on the pipeline graph generation and its check to ensure validity of the final graph. In addition, this step ensures that mandatory input files exists etc. Imo it should also be no big problem to install jip in the remote cluster as the installation and execution process happens completely in user spaces. Please note also that there is no need to start any server on the remote side if you have access to a job scheduler like SGE, slurm, or PBS. Only the jip executable (+ dependencies) need to be available.

With this in mind, what I see right now is essentially a SSH wrapper to delegate command executions to a remote side and get back some data that can be processed locally.

What I am not sure about right now is how to deal with input files and if it should be part of jip to take care of syncing the files automatically. But I would probably suggest to start with the SSH wrapper only. File syncing is for sure nice to have, but as you already mentioned, you could also simply use something like rsync.

Best, -Thasso

2014-10-07 14:27 GMT+02:00 John Didion notifications@github.com<mailto:notifications@github.com>:

Thanks for the response! The use case I was thinking of is that sometimes I want to run jobs on the cluster and sometimes I want to run them on my local machine. It would be nice to manage both things from my local desktop. If you have all of the commands going over ssh, then I don’t think you need to have anything installed on the remote machine except the scripts that actually get executed (and the data, of course, but it’s beyond the scope of JIP to manage that). The scripts could just be copied to the remote server via scp, or kept in sync via rsync. I’d be happy to contribute to development to make this a part of JIP, but I first I would want to work with you on a plan for the best way to implement it.

Thanks,

John Didion, PhD Postdoctoral Fellow Collins Lab, NHGRI

On Oct 7, 2014, at 8:07 AM, thasso notifications@github.com<mailto:notifications@github.com<mailto: notifications@github.commailto:notifications@github.com>> wrote:

Hey,

there is currently no way to talk to the remote side directly. The way we use the system on a "remote" cluster is to first ssh to the login node and then install, configure and use jip there. I was in fact thinking about adding a thin "remote" layer on to so you could execute commands directly from your local machine, and we might still implement something like this, but none the less, jip needs to always be installed and available on the remote cluster/login node. The reason for this is that the jobs interact directly with the job database and do not do this through a server. This avoid starting a server on your cluster and you don't need to enable connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your workstation, feel free to explain your use case in a bit more here and we can see if we can do something about it.

— Reply to this email directly or view it on GitHub< https://github.com/thasso/pyjip/issues/55#issuecomment-58174142>.

— Reply to this email directly or view it on GitHub https://github.com/thasso/pyjip/issues/55#issuecomment-58176115.

— Reply to this email directly or view it on GitHubhttps://github.com/thasso/pyjip/issues/55#issuecomment-58179348.

jdidion commented 10 years ago

I’ve thought through this some more, and I think I have a good plan. The model I am working with is that there will be two separate JIP databases, one on the local machine and one on the remote cluster. For a job submitted from the local machine to the remote cluster (via the new ‘remote’ command described below), a placeholder will be inserted into the local database marking that job as a remote job. Subsequent calls to ‘jip jobs’ will fetch job information from the remote machine and update the placeholder records in the local database. That way, a user can track all job information in his local database even if some jobs are submitted locally and some are submitted remotely.

New commands:

I think these changes should be fairly transparent to current users, i.e. it wouldn’t affect how they currently do things. There would have to be some kind of database migration step to upgrade the database schema for current users that want to take advantage of the new work flow, but I think SQLAlchemy has facilities for that.

Please let me know if you see any problems with this approach, or if you recommend a better way of doing it.

Thanks,

John Didion, PhD Postdoctoral Fellow Collins Group, NHGRI

On Oct 7, 2014, at 9:06 AM, John Didion john.didion@nih.gov<mailto:john.didion@nih.gov> wrote:

Sure, I think that workflow will definitely appeal to some users. For me, i’m more interested in having a single interface to execute jobs, whether they be local or remote, but the jobs I run locally and remotely are different (the remote jobs are pipelines for processing sequencing data, while the local jobs are typically quick analyses on the processed data).

The ssh wrapper sounds like the best solution. As far as input files, I think JIP should handle syncing any JIP scripts but (at least for now) it’s up to the user to make sure the data files are in place before executing the job. To me it doesn’t seem ideal to have JIP trying to manage keeping dozens or hundreds of huge bam files in sync.

As far as implementation, can you briefly describe how you would go about doing it? You’re much more familiar with the structure of the code and I want to implement this in the way that makes the most sense.

Thanks

John Didion, PhD Postdoctoral Fellow Collins Lab, NHGRI

On Oct 7, 2014, at 8:55 AM, thasso notifications@github.com<mailto:notifications@github.com> wrote:

Hi John,

okay I think I understand the use case and I think it would be a good idea to support a workflow where you can basically migrate jobs between jip instances, say run first locally, then remote. Note that currently there is no easy way to avoid the installation if jip on the remote side. It is needed not only for the actual job execution but also for the creation of the pipeline graph which imo should be create from within the execution environment (the remote side). Keeping it like this means we can rely on the pipeline graph generation and its check to ensure validity of the final graph. In addition, this step ensures that mandatory input files exists etc. Imo it should also be no big problem to install jip in the remote cluster as the installation and execution process happens completely in user spaces. Please note also that there is no need to start any server on the remote side if you have access to a job scheduler like SGE, slurm, or PBS. Only the jip executable (+ dependencies) need to be available.

With this in mind, what I see right now is essentially a SSH wrapper to delegate command executions to a remote side and get back some data that can be processed locally.

What I am not sure about right now is how to deal with input files and if it should be part of jip to take care of syncing the files automatically. But I would probably suggest to start with the SSH wrapper only. File syncing is for sure nice to have, but as you already mentioned, you could also simply use something like rsync.

Best, -Thasso

2014-10-07 14:27 GMT+02:00 John Didion notifications@github.com<mailto:notifications@github.com>:

Thanks for the response! The use case I was thinking of is that sometimes I want to run jobs on the cluster and sometimes I want to run them on my local machine. It would be nice to manage both things from my local desktop. If you have all of the commands going over ssh, then I don’t think you need to have anything installed on the remote machine except the scripts that actually get executed (and the data, of course, but it’s beyond the scope of JIP to manage that). The scripts could just be copied to the remote server via scp, or kept in sync via rsync. I’d be happy to contribute to development to make this a part of JIP, but I first I would want to work with you on a plan for the best way to implement it.

Thanks,

John Didion, PhD Postdoctoral Fellow Collins Lab, NHGRI

On Oct 7, 2014, at 8:07 AM, thasso notifications@github.com<mailto:notifications@github.com<mailto: notifications@github.commailto:notifications@github.com>> wrote:

Hey,

there is currently no way to talk to the remote side directly. The way we use the system on a "remote" cluster is to first ssh to the login node and then install, configure and use jip there. I was in fact thinking about adding a thin "remote" layer on to so you could execute commands directly from your local machine, and we might still implement something like this, but none the less, jip needs to always be installed and available on the remote cluster/login node. The reason for this is that the jobs interact directly with the job database and do not do this through a server. This avoid starting a server on your cluster and you don't need to enable connection from and to your compute cluster.

I hope it helps!

If you find a need for a thin client that you can use directly from your workstation, feel free to explain your use case in a bit more here and we can see if we can do something about it.

— Reply to this email directly or view it on GitHub< https://github.com/thasso/pyjip/issues/55#issuecomment-58174142>.

— Reply to this email directly or view it on GitHub https://github.com/thasso/pyjip/issues/55#issuecomment-58176115.

— Reply to this email directly or view it on GitHubhttps://github.com/thasso/pyjip/issues/55#issuecomment-58179348.

thasso commented 10 years ago

Sounds good to me. I don't see any obvious flaws at the moment and we can iterate on this. If you want to start implementing this, please note that current development version of JIP can be found in the "develop" branch. I would suggest you create a pull request against that branch and I can review the changes before merging them in.

jdidion commented 10 years ago

I’ve scaled this back a bit. I decided mucking up the database with lots of pointers to jobs running on remote hosts was probably not worth the cost. Instead I’m implementing the following:

export: export jobs as a json object submit (modification of existing command): add the ability to specify an export jobs json object to be imported. Any other command line options will override the values loaded from the json object. migrate: export a job from the local database, copy it to the remote machine, optionally also sync scripts, import the job, and optionally submit the job

This should be relatively quick to implement since I will be using an existing ssh library (although this will create an additional dependency).

On Oct 13, 2014, at 7:00 AM, thasso notifications@github.com wrote:

Sounds good to me. I don't see any obvious flaws at the moment and we can iterate on this. If you want to start implementing this, please note that current development version of JIP can be found in the "develop" branch. I would suggest you create a pull request against that branch and I can review the changes before merging them in.

— Reply to this email directly or view it on GitHub.

jdidion commented 10 years ago

Hi Thasso,

I am writing a pipeline where the first step is parallel alignment of multiple sets of fastq files and the second step will be to merge the resulting BAM files. I’m not sure of how to implement this in JIP.

I assume the first step is to iterate over all pairs of fastq files and call run, i.e.:

p = Pipeline() for f1,f2 in fastqs: p.run(‘align’, input=(f1,f2))

But how to make those run as a single group, and to have the merge step depend on the completion of all the jobs in that group?

Thanks,

John

On Oct 15, 2014, at 5:57 PM, John Didion johnpaul@didion.net wrote:

I’ve scaled this back a bit. I decided mucking up the database with lots of pointers to jobs running on remote hosts was probably not worth the cost. Instead I’m implementing the following:

export: export jobs as a json object submit (modification of existing command): add the ability to specify an export jobs json object to be imported. Any other command line options will override the values loaded from the json object. migrate: export a job from the local database, copy it to the remote machine, optionally also sync scripts, import the job, and optionally submit the job

This should be relatively quick to implement since I will be using an existing ssh library (although this will create an additional dependency).

On Oct 13, 2014, at 7:00 AM, thasso notifications@github.com wrote:

Sounds good to me. I don't see any obvious flaws at the moment and we can iterate on this. If you want to start implementing this, please note that current development version of JIP can be found in the "develop" branch. I would suggest you create a pull request against that branch and I can review the changes before merging them in.

— Reply to this email directly or view it on GitHub.

thasso commented 9 years ago

Hi John,

Sorry for the major delay, but now there is some time finally :)

You are on the right track already with the dependencies. You can do it as you tried in your example and simply iterate over your fastq input files. In order to establish dependencies, you can use the dependsOn function exposed on the node objects. But heres the point where the jip dependency resolution and edge-multiplicities can really help. I will try to layout the full example to showcase how you can use job parameters from the pipeline and its nodes.

The assumption here is basically that the align job expects a single fastq file as input while the merge job takes a list of files. This allows you to

  1. Expand the align jobs based on the list of input parameters
  2. Collapse on the merge job based on the list of output alignments created by the align jobs

In pseudo code, this would look something like this:

fastqs = [...]
p = Pipeline()

align = p.run('align', input=fastqs, output='${input|ext}.bam')
merge = p.run('merge', input=align)

# expand the pipeline. Now you'll have n align jobs but still one merge job
p.expand()
# create teh list of jobs
jobs = jip.create_jobs(p)

Please note that the expansion and job creation is on necessary if you use the jip command line tools and write the pipeline using a jip pipeline script.

Also note the output parameter of the align job. I assumed that the job allows you to specify the name of the output file. We need something dynamic here because we want to expand on the list of input files.

The final merge jobs input is just the align job. This works as long as align only defines a single output. If thats not the case, you'll need to be more specific: p.run('merge', input=align.outout).

Using the jip pipeline graph and expanding on the inputs and outputs of the nodes allows you to avoid specifying dependencies explicitly.

I hope it was not too late and still helps.

Best, -Thasso