slurm scheduler not working with slurm-wlm-torque qstat

ssadedin / bpipe

Bpipe - a tool for running and managing bioinformatics pipelines

http://docs.bpipe.org/

Other

225 stars 57 forks source link

slurm scheduler not working with slurm-wlm-torque qstat #280

Open nemartins opened 1 year ago

nemartins commented 1 year ago

I'm trying to run bpipe on a slurm cluster. This cluster does not have qstat installed, so the pipeline never progresses to the next step. I've tried to use qstat from slurm-wlm-torque package, but there's no xml output option.

Is it possible to create a SlurmStatusMonitor that alleviates the dependency for the qstat xml output or that uses the native slurm tools (sstat, scontrol)?

Thanks in advance,

ssadedin commented 1 year ago

Interesting that you are the first to run into this (or at least the first to report it). I guess it must be unusual for SLURM clusters to not have qstat installed.

The implementation that does not require the XML output option is still supported, so I think we should be able to convince the SLURM executor to use it. As you say, it'd be nice if we can make something more efficient but as a fallback it should work.

I don't have a good test method for SLURM right now - I will see if I can set up something that uses AWS parallelcluster so I can get this working properly.

Sorry for the problem and will look into what to do.

ssadedin commented 1 year ago

@nemartins I have just put in a commit that I think should fix the SLURM issue at least in a basic sense. I was able to create a test cluster on AWS and confirmed that it seemed to work. If you are able to build from master and try it out then that would be great. Otherwise, let me know I can provide you with a build or you can test it with the next release. Thanks for reporting this issue!

nemartins commented 1 year ago

Thank you for looking into this! In the meantime I was able to hack together a small script to pipe the output to the xml format that the executor expects and it worked ok.

I will try to build it from master and run it on the cluster early next week.

Thanks again

ssadedin commented 1 year ago

that's a very clever way to try and solve it - would be interesting to see it if you are interested to share. It could allow us to used the the pooled status monitor with slurm which would be a better solution (current solution will cause an individual job status command to be issued for every active job every minute or so - not very scalable, which was why the pooled status monitor which queries multiple jobs at a time was introduced.

Let me know how it goes!

nemartins commented 1 year ago

I've ran bpipe from master, and it works well. Thank you for the quick solution!

Here's the script I've come up with. It could probably be way simpler/elegant, but it was a very rush job, and the first time I've used jq/yq

export params="${@:2:1000}"
qsub $params |\
     sed -e 's|Job id|JobID|g' -e 's|Time Use|TimeUse|g' |\
     csvtk space2tab --comment-char '-' |\
     csvtk csv2json -t |\
     jq -c .[] |\
     jq -n 'reduce inputs as $line ({};. + { ("DataZ"+$line.JobID) : { "Job": {"Job_Id": ($line.JobID),"job_state": ($line.S)}} })' |\
     yq -o xml |\
     sed -e 's|DataZ.*>|Data>|g' |\
     tr -d "\n" | tr -d "\ " |\
     awk -v RS='</Data>' -v ORS='</Data>\n' ' {print}'

Best