katholt / RedDog

32 stars 4 forks source link

SLURM and RedDog #58

Closed nschiraldi closed 5 years ago

nschiraldi commented 5 years ago

I recently commented on a Rubra issue requesting support for SLURM (https://github.com/bjpop/rubra/issues/17) but I was wondering if anyone here has any experience with running RedDog via SLURM? I'll close this, if the issue is better discussed over @ Rubra, but I'm using RedDog primarily.

d-j-e commented 5 years ago

I have read the post in Rubra and feel this is more a reddog issue as such, so am answering here...

Simple answer is yes you can run reddog like this, though I would advise using the SLURM branch of Rubra instead (https://github.com/bjpop/rubra/tree/slurm) and run reddog on the head node (i.e. use distributed: True flag). You may find your local instance of SLURM does not quite work, but we do have some ''solutions' to common problems that may help.. (Currently moving and adapting to new SLURM cluster atm)

There are problems running reddog as a single job. For one, you don't know how long the job will take overall. Whilst for smaller runs which take a day or less this may not be a problem, for large jobs which can take weeks the scheduler may terminate before completion. Fortunately reddog is written to restart from interruptions, so this may not be problem as such. But reddog is also designed to run on the head node and send out each task separately - this is to better distribute the load whilst also sharing resources. Plus, about half the stages in the pipeline only use/require one processor, so the rest of the processors would be idle. If you do not share, you may not see this as a problem.

If you are still determined to try running reddog as a single job, then you will need to make sure you add the loading of the all modules required by the pipeline in your batch script. Essentially, if you set 'distributed' to FALSE, rubra ignores the rest of the stageDefaults. As such, I'm not sure reddog won't just run each job of each stage sequentially (though having never run reddog on local server, not 100% sure - reddog can surprise even me at times...).

Also, I cannot comment on what will happen if reddog does encounter a fatal error during the run, though typically we check back on the run to find it has hung up on an error but not yet terminated (this 'behaviour' is described in the manual). And reddog is currently designed to be interactive so you will have to modify the reddog.py script slightly to exclude (comment out) the user-input step...

Probably confused you, but hope this does help...

nschiraldi commented 5 years ago

Doh -- I didn't see the SLURM branch. Thanks for the detailed response, I had already modified reddog.py so I can submit via SBATCH but we've found it difficult to keep track of what stage the pipeline is at. If the scheduler is submitting new jobs at different stages, it should be a lot easier for us to tell. It was also an issue that we were locking up resources that weren't truly being used, as you describe, and we were running into memory partitioning issues for similar reasons.

I may have a few more questions once I start working with the SLURM branch, but for now, thank you very much!