psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Parallelization on other queueing systems #190

Closed krdav closed 7 years ago

krdav commented 8 years ago

As I know there are a lot of bioinformaticians (including myself) that are using compute clusters running the PBS scheduler like TORQUE. So it is a shame that partis does not support this using this. I know that slurm is suggested but I don't have access to an HPC running slurm.

The case is that I am running into slow runtimes when running large files, like this run I just did yesterday: 500,000 seqs 28 cpus with total time of 17:41:00 and 78.82% cpu efficiency.

I have been thinking about how to implement it myself but haven't gotten so far yet. Any suggestions/starting point would be welcomed.

matsen commented 8 years ago

Thanks for the suggestion, @krdav .

For our part, we are working on making the core of partis faster so that it won't require so many HPC gymnastics, but that's still going to be a while coming. We don't have access to a PBS cluster, so I can't say for sure, but partis' use of slurm just involves tacking srun in front of lots of commands. So you could try forking it and changing the PBS equivalent, but if you have a more elegant and general solution that would be great!

Just out of curiosity, if we had a setup on AWS that you could pay for and use without configuration, would that be of interest? We haven't set this up but are curious about demand.

krdav commented 8 years ago

Yeah I noticed how slurm is used with the srun command by grepping it out of the code. Looking at the way srun is used together with .poll() to determine when the job is done, it looks as if srun is interactive and the output from partis will stream to the shell. In that way it is also easy to check with .poll() if the job is done.

On a PBS system a job is submitted with the qsub command and after a little delay the job is detached from the shell with a message giving information about which node it is running on. Problem then is that I don't really know how to monitor the job, seeing when it is done to tell the next process in partis to continue.

Setting up a cloud is a great idea that would increase accessibility, but for my use it is probably not going to work. Main reason is that some of the data is confidential, secondary reason is that I like to control things by myself; run small sandboxes, make tests and modifications in the source code. The price of using AWS would not be our main issue (we are anyways paying for HPC core hours) but having 100% security is very important and something that is harder on AWS. Not that I don't believe an AWS instance can be secure enough but the people giving me their data would not be so happy about it...

psathyrella commented 8 years ago

jeez, yeah, sorry about that, running on big data sets would suck without batch parallellization.

So as you saw, the interaction with slurm is pretty straightforward, so hopefully it's easy to generalize to torque. Since we don't have a test system (maybe in docker, though), maybe you could suggest a few changes to the lines? I mean maybe it's just qsub instead of srun, but there's probably a bit more shenaniganery.

So there's basically two places in the code where we use slurm, in partitiondriver.py and waterer.py. They do basically the same thing: start each process with a fcn in utils.py that uses Popen with output redirection (cmd_str + ' 1>' + workdir + '/out' + ' 2>' + workdir + '/err'). They then both use poll() to see when each process is done; when it's done they use another fcn in utils.py to checkout output and potentially rerun. Does that look workable?

krdav commented 8 years ago

Yes, this all makes sense. I also deduced that from looking at the code. The problems is that qsub is not interactive and therefore not running in the shell. Hence .poll() will probably not work.

During yesterday after searching around stackoverflow and thinking about it myself I have some ideas to start of. When I have it implemented and tested it I will let you know.

psathyrella commented 8 years ago

Chaim is on SGE, so it would also be useful if this was also supported.

scharch commented 8 years ago

On SGE/PBS, you can pass hold_jid to qsub to have one job wait to execute until a previous one is finished, although it doesn't appear that an active job can modify itself in this way. The most direct way to mimic your current set up would be repeated calls to qstat instead of .poll(), though we'd have to parse the output. The "correct" solution is probably to use MPI; in the long run, I'd guess that's a more general solution, as well. But, I don't have any experience with MPI, and it's probably a lot more work than you are interested in. My own inclination would be to integrate this with issue 196: qsub n-procs subtasks (likely as an array job) and then qsub self.finalize() (in waterer.py) as a separate job (with -hold_jid) and exit. A similar approach could be use to implement each round of the full partitioning algorithm. The main need here is a way to re-enter an in-progress run from the command line.

krdav commented 8 years ago

This is also something I have been thinking about but I concluded that making a solution with -hold_jid or some other PBS tricks would not be a good idea because then we might run into PBS versioning problems and HPC system specs. that for various reasons does not allow certain PBS flags.

An alternative idea would be to create an empty file just after finishing the qsub job. Instead of doing .poll() look for the empty file; if its there continue, else wait.

How then to merge the STDOUT from each qsub submission? Just redirect from all jobs into the same file. It might not be well ordered but better than nothing.

scharch commented 8 years ago

I'm not sure why you consider hold_jid a 'trick' --it is a standard option and shouldn't be subject to any system-specific implementation weirdness. qstat is a better (and more "official") way to check job completion than creating empty files. It also allows for checking the status of all jobs at once (just loop until it returns an empty string). Could also use qacct to check exit status of individual jobs. Sending STDOUT from all jobs to the same file seems like a bad idea... Anyway, it looks like partitiondriver.merge_all_hmm_outputs() and waterer.read_output() are already merging output from multiple files, so you would just have to pass in the file name structure used by qsub to maintain compatibility.

krdav commented 8 years ago

Well maybe it is not "tricks" but then again I know for a fact that there are lots of system specific implementations out there. E.g. qacct is not on the cluster that I am working on, neither is the -hold_jid flag for qsub. Instead of the -hold_jid flag we have a qhold command. I don't see why qstat should be more official than any other hack and of course you can also check the status of all jobs by looping over a list of file names. MPI would be the right solution but also that has its problems (e.g. implementing it, users have problems with it etc.)

scharch commented 8 years ago

OK I think this might be about subtle differences between SGE and PBS, which I hadn't realized. So you are correct that it make sense to aim for the lowest common denominator...

qhold is not a great option here - at least in its SGE implementation, you would have to have another process to monitor and send the qrls command, so not much gained.

qstat is not a hack -- it's the expected way to check job status, which is why I think it is a better option. It can also catch failed processes, which could potentially cause an infinite waiting in a naive implementation of your suggestion (since they will never write the completion file). But it's ultimately a matter of personal preference, I suppose.

Thinking aloud now, resource management is another thing to keep in mind. In many implementations, you are required to tell the scheduler how long your process will run and it will get killed if it goes over. (This is part of why I suggest the iterative calls above.) So the master process probably has to run locally -- which would be fine for me personally, but could potentially cause issues if libraries are in different places or if network drives are mounted differently.

psathyrella commented 8 years ago

Thinking about this with a little more distance: I'll first reiterate that it really, really sucks that no one except me seems to be yet able to run on more than one machine. That makes the whole thing wildly less useful. Second, my suggested conclusion from the fact that between the three of us we already have three different batch parallelization systems (and we're not sure the best way even in principle to add support for the other two) is that I should focus on the EC2-type solution for the long term (see also #164). It would of course be great if anyone wants to code up support for a non-slurm batch system, but it seems like a poor use of time for me to try to guess how to do it, given I don't have one to test on.

I definitely appreciate the fact that at the moment many collaborators aren't willing to have their data "in the cloud", but it's really hard for me to imagine a future in which this objection doesn't fall by the wayside as pretty much everything ends up off-site.

psathyrella commented 7 years ago

@Irrationone did sge for me!

https://github.com/Irrationone/partis

psathyrella commented 7 years ago

ok, I think I've set things up so sge will work. I simplified the subprocess and batch interfaces a bit, so this entailed adding 'sge' to the options like so: https://github.com/psathyrella/partis/blob/dev/bin/partis#L286 and modifying the command string here: https://github.com/psathyrella/partis/blob/dev/python/utils.py#L1608 any additional batch options can be passed in with --batch-options: https://github.com/psathyrella/partis/blob/dev/bin/partis#L287

I used as a template @Irrationone's commit here: https://github.com/Irrationone/partis/commit/0571ef775966a57609edd8f8a78e046a9a2e80ea on which a couple questions:

@Irrationone, I wonder if you could test this to see if it works? (Note that it's on the dev branch -- let me know if that's a problem.

and @scharch and @krdav, could you tell/show me how to add any other batch systems that'd be convenient?

thanks!

scharch commented 7 years ago

My understanding is that docker doesn't play nicely with sge because it wants root permissions, but maybe that's been fixed since I last looked at it.

If you don't supply the -o/-e options, sge redirects stdout/stderr of the job to something like .[oe] Stdout/stderr of qsub captured by Popen won't be very interesting.

As far as I know, SGE allocates resources on a per-CPU basis, so it's pretty much 1 cpu per task. Columbia's hpc had an option to reserve an entire physical machine so you could use all 8 of its cores at once, but I think there were pretty strict limitations on how much you could do that.

~CH

On Oct 27, 2016 5:24 PM, "Duncan Ralph" notifications@github.com wrote:

ok, I think I've set things up so sge will work. I simplified the subprocess and batch interfaces a bit, so this entailed adding 'sge' to the options like so: https://github.com/psathyrella/partis/blob/dev/bin/partis#L286 and modifying the command string here: https://github.com/psathyrella/partis/blob/dev/python/utils.py#L1608 any additional batch options can be passed in with --batch-options: https://github.com/psathyrella/partis/blob/dev/bin/partis#L287

I used as a template @lrrationone's commit here: Irrationone@0571ef7 https://github.com/Irrationone/partis/commit/0571ef775966a57609edd8f8a78e046a9a2e80ea on which a couple questions:

  • I couldn't get an sge docker image to work, so I'm not sure if by default sge spits std out and err to the usual places from a job? If so, then the way it's set up now, using the stdout and stderr options to subprocess.Popen() should work. Otherwise we can use the -e and -o options.
  • the -l memory reservation options should work passed in as, e.g., --batch-options="-l h_vmem=24G,mem_token=24G,mem_free=24G"
  • for vsearch partitioning, paralellization within a job entails multithreading... for which with slurm I reserve the proper number of cpus with --cpus-per-task. I didn't see an equivalent sge option, but I could have missed it? Worse case is it just hammers your cluster.

@lrrationone, I wonder if you could test this to see if it works? (Note that it's on the dev branch -- let me know if that's a problem.

and @scharch https://github.com/scharch and @krdav https://github.com/krdav, could you tell/show me how to add any other batch systems that'd be convenient?

thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/190#issuecomment-256774220, or mute the thread https://github.com/notifications/unsubscribe-auth/AGZe4KABpgVbxg7BBA7vIvZFdgRgNN0tks5q4RaBgaJpZM4InqnO .

psathyrella commented 7 years ago

The docker image I found uses the -f option, e.g.

docker run -it --rm gawbul/docker-sge login -f sgeadmin

not sure if that works in practice though. I actually installed partis on a new laptop running ubuntu 16.04, and was shocked how easy it was, so I added a comment here with some tips. It just required a couple extra apt-get and pip installs Honestly for a lot of purposes I think it's just easier to not use docker at this point, depending on your os of course.

ok, thanks. I added the -e and -o options.

Well, maybe it's ok how it is then. You can tweak anything you need to with --batch-options. The important thing is when you're running vsearch partitioning that you can use more than one thread, which you can -- it's also good to notify the batch system that you're doing that, but that's a secondary goal (as long as I'm not your sysadmin...).

scharch commented 7 years ago

Near as I can tell, docker-sge is for running sge from within docker, not running docker on an sge cluster. Sounds great for creating your own cluster with aws, though.

On Oct 27, 2016 6:30 PM, "Duncan Ralph" notifications@github.com wrote:

The docker image https://github.com/gawbul/docker-sge I found uses the -f option, e.g.

docker run -it --rm gawbul/docker-sge login -f sgeadmin

not sure if that works in practice though. I actually installed partis on a new laptop running ubuntu 16.04, and was shocked how easy it was, so I added a comment here https://github.com/psathyrella/partis/blob/dev/manual.md#installation-from-scratch with some tips. It just required a couple extra apt-get and pip installs Honestly for a lot of purposes I think it's just easier to not use docker at this point, depending on your os of course.

ok, thanks. I added https://github.com/psathyrella/partis/commit/6abf905513dbc0ac4aee65671a1d2961d692b835 the -e and -o options.

Well, maybe it's ok how it is then. You can tweak anything you need to with --batch-options. The important thing is when you're running vsearch partitioning that you can use more than one thread, which you can -- it's also good to notify the batch system that you're doing that, but that's a secondary goal (as long as I'm not your sysadmin...).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/190#issuecomment-256788190, or mute the thread https://github.com/notifications/unsubscribe-auth/AGZe4MJMps3ObkShafVaeYbtFzPwznBsks5q4SXsgaJpZM4InqnO .

krdav commented 7 years ago

As Chaim suggests Docker is not playing nice with SGE/PBS at this point and as Duncan experienced it is extremely easy to compile partis yourself. I was myself testing out the docker image when starting to use partis but quickly found that it takes very little to compile it yourself so this is what I have been doing ever since.

What Chaim says about 1 cpu per task on the SGE system also applies for PBS but I think it is not uncommon to enable reservation on an entire machine/node. E.g. if you submit to a system composed of multiple 8 core machines then if you submit a job with a reservation on 8 cpu's the job will simply be pushed to reserve a full node and then you can also get the full benefit of the physical memory of the given node. For me this means I can either submit to a 28 cores/128Gb memory or a 32 cores/1024Gb memory node and therefore runtime has never really been a problem for me when it comes to caching. My problem when starting this discussion was that the partitioning was running very slow for me and know with the tremendous improvements in this procedure its not a problem anymore. Right now I can submit caching/partitioning + my own hacks and have it finishing on 8 heavy and 8 light chain samples in less than an hour.

In conclusion I think that SGE/PBS parallelization is getting less important now with the runtime improvements done on the base code.

On 28/10/2016 00.44, scharch wrote:

Near as I can tell, docker-sge is for running sge from within docker, not running docker on an sge cluster. Sounds great for creating your own cluster with aws, though.

On Oct 27, 2016 6:30 PM, "Duncan Ralph" notifications@github.com wrote:

The docker image https://github.com/gawbul/docker-sge I found uses the -f option, e.g.

docker run -it --rm gawbul/docker-sge login -f sgeadmin

not sure if that works in practice though. I actually installed partis on a new laptop running ubuntu 16.04, and was shocked how easy it was, so I added a comment here

https://github.com/psathyrella/partis/blob/dev/manual.md#installation-from-scratch with some tips. It just required a couple extra apt-get and pip installs Honestly for a lot of purposes I think it's just easier to not use docker at this point, depending on your os of course.

ok, thanks. I added

https://github.com/psathyrella/partis/commit/6abf905513dbc0ac4aee65671a1d2961d692b835 the -e and -o options.

Well, maybe it's ok how it is then. You can tweak anything you need to with --batch-options. The important thing is when you're running vsearch partitioning that you can use more than one thread, which you can -- it's also good to notify the batch system that you're doing that, but that's a secondary goal (as long as I'm not your sysadmin...).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub

https://github.com/psathyrella/partis/issues/190#issuecomment-256788190, or mute the thread

https://github.com/notifications/unsubscribe-auth/AGZe4MJMps3ObkShafVaeYbtFzPwznBsks5q4SXsgaJpZM4InqnO .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/psathyrella/partis/issues/190#issuecomment-256790782, or mute the thread https://github.com/notifications/unsubscribe-auth/AQO7g0vEFZCj_xrScfC5eFjcG_I1h-8pks5q4Sk_gaJpZM4InqnO.

Irrationone commented 7 years ago

@psathyrella SGE parallelization (at the very least for cache-parameters and the full partition) works for me for a non-Docker build of the dev branch.

If you have the time it might be worth warning the user to not specify their own -e and -o options within --batch-options, or better yet, checking for their existence.

psathyrella commented 7 years ago

Great, thanks.

I added warnings to the dev branch, and this'll get closed when I merge that, since no one seems to be screaming for another batch system. Feel free to comment/reopen if you do, of course.