SLURM 2.6 not queueing naythang

mariogiov commented 10 years ago

Hey hey,

We upgraded to SLURM 2.6.2 recently and I haven't been able to get any jobs queued since (working through bcbio_nextgen). It seems to create the batch files but never executes them and then eventually just times out. I'm using IPython version 1.1.0.

Sorry to treat this like a user forum but I'm having some trouble following the execution flow as it jumps out of pdb when it makes the actual call to ipcluster. Any thoughts? Has anyone else had this work?

roryk commented 10 years ago

Hi Mario,

Dang, sorry for the trouble, I thought we had this one licked. Thanks so much for posting about the issue, use this as a user forum all you like, knowing it doesn't work places is super helpful.

We are running SLURM 2.6.3 and it seems to work ok; I'm not sure if there is anything different about 2.6.2 that would be a showstopper but it is possible. If you run the example script that comes with ipython-cluster-helper like this:

python example/example.py --scheduler slurm --num_jobs 3 --queue your_queue

Do you get any more useful feedback? It is super weird that it is not even putting jobs on the queue in the first place.

mariogiov commented 10 years ago

Actually I think I misspoke earlier -- it doesn't seem to be creating SBATCH files in the first place. I was backing through the commits from ipython_cluster_helper yesterday so perhaps I was using an older version, although the SBATCH files I mentioned appeared to be from the newer 2.6-based templates (i.e. BcbioSLURM as opposed to BcbioOLDSLURM) in cluster.py.

The example.py doesn't ever hit the queue either; however, the difference between the example.py run and the bcbio_nextgen run is that when I try to distribute tasks using the latter it fails to create/find any config files (presumably it should be doing this?). Neither one actually queues anything -- I cancel them in these examples but there are no jobs in the queue so they time out on their own if I don't intervene.

example.py: http://pastebin.com/2xuY8dpk bcbio_nextgen.py: http://pastebin.com/zi6YSmQy

What may be a separate issue that I haven't hit yet is that the newer 2.6-based templates for BcbioSLURMEngineSetLauncher and BcbioSLURMControllerLauncher don't include any of the extra "resources" information that is required to start jobs, specifically the -A (account) and -t (timelimit) flags. They also omit the -N (number of machines) flags. Would this information be added in some other way? At least over here these flags are required for job submission. Usually they're specified via e.g. -r "machines=1;account=a2013023;timelimit=2-00:00:00" (or with --resources for the example.py).

roryk commented 10 years ago

Hi Mario,

Thank you for investigating this. For our setup, we basically just have one generic cluster_users account which jobs without -A set defaults to and I think the timeout is set automatically by the queue the job is sent to, which explains why we haven't had this issue. I can think of a couple of solutions, but I wanted to run them by you. I think we could possibly set the account automatically based on what queue the job was being submitted to; you can get which accounts can submit to a queue with sinfo and then figure out which accounts you can use with sshare, so theoretically we could pick an account that can submit to the queue out of that. If the accounts are set up to track payments or something, that won't work though. We could also just have those options get passed on. Would the automatic resolution work for you? Or does a specific account need to be used?

roryk commented 10 years ago

Also is setting the machines and the timelimit necessary to submit the job? Here the default timelimit is the maximum time a job can remain in the queue. Does -N default to 1? From the man page it says that -N=1 gives a minimum number of nodes.

chapmanb commented 10 years ago

Mario; bcbio-nextgen should be creating slurm_controllerGUID and slurm_engineGUID files and then submitting these with sbatch. That's really all the magic that happens, so you can debug directly by changing the slurm files until they submit cleanly. Just change the actually command into something simple (sleep 5; pwd) and we can debug the submission part separately from starting up engines/controllers.

It sounds like what we need are ways to ensure resources are correctly set for slurm. Right now you specify these with -r machines=1 -r account=a2013023 in bcbio_nextgen, but they're likely not being translated correctly into the final script to make your setup happy. If you could post what the original script and a working one look like, we could work on making sure the resources get translated as expected.

As Rory mentioned, the more minimal we can get the better off we'll be. Stuff like timelimits are difficult to estimate so if we could avoid them and still make submissions happy that would be super helpful.

roryk commented 10 years ago

You guys are exactly right-- for SLURM any extra resources were not being passed in like they were in the OLDSLURM. I added that in for SLURM, and also set a series of defaults if you do not specify resources. The -N defaults to 1 if you don't specify it, -t defaults to the maximum time for the queue you are submitting to and -A defaults to a random account that you have access to and can submit to the queue you specify.

Here is what the script looks like for the engines with those changes:

#!/bin/sh
#SBATCH -p general
#SBATCH -J bcbio-ipengine[1-3]
#SBATCH -o bcbio-ipengine.out.%j
#SBATCH -e bcbio-ipengine.err.%j
#SBATCH --cpus-per-task=1
#SBATCH --array=1-3
#SBATCH -A cluster_users
#SBATCH -t 7-00:00:00
#SBATCH -N 1

/n/home05/kirchner/anaconda/envs/ipc/bin/python -c 'from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance()' --timeout=60 --IPEngineApp.wait_for_url_file=960 --EngineFacto
ry.max_heartbeat_misses=100 --profile-dir="/n/home05/kirchner/.ipython/profile_b338bf4c-4646-11e3-ad9e-60eb69edd834" --cluster-id="b43bff65-cbfe-4192-b5f5-91ce2519abb1"

Does that look like it has all the fields you need, Mario?

roryk commented 10 years ago

Hi Mario,

I addressed this here: abc48468d0a703d1e5b532cf71e801aee7b7e3cc and fb3640b47e66018344eb7ced48538f. Let me know if it doesn't work for you, it seems good here. I also added in setting a default memory size of 250M, since some SLURM environments (like ours!) by default gives you almost no memory per core, which was causing issues.

mariogiov commented 10 years ago

Hi guys,

First, thanks to you guys for working on this. Rory, that file looks great. You are right that there are defaults for -N and -t so those are not required; however, we need an -A (account name) or SLURM here won't queue anything, and default-and-unsettable parameters for the other two would not be a great scenario for us. More specifically:

We must specify the account name because different accounts are used for different projects and each account has only a limited number of hours. It is great to have a default for the general case but in our case things can go very poorly if we use up hours meant for one project on another.
We need to specify things like the time limit and the number of nodes / cores because those affect how SLURM prioritizes the jobs. If we set all our jobs to the maximum time limit I don't know if we'd ever actually hit the top of the queue. Or I guess we probably would in July when everyone in Sweden goes on vacation.

In case you're curious, here's the trouble I was causing for myself yesterday:

bcbio_nextgen was producing SLURM_controllerGUID and SLURM_engineGUID files using the new template format, but because these didn't include the -A {account} bit SLURM would reject them for having incorrect parameters.
I dug around in ipython_cluster_helper/cluster_helper/cluster.py and added the #SBATCH -A {account} bit back in, but in a move that I will attribute to insufficient coffee I failed to adjust the code to pass the account parameter from _start to the Launcher classes and then the sbatch files didn't even get created in the first place. This was the source of my confusion about whether or not the files were being created.

So to conclude, Rory I think your changes look great as long as we can still pass resources in manually (either via the -r "account=b2013064;timelimit=02:00" format or the -r account=b2013064 -r timelimit=02:00 format -- the second looks nicer to me and I think this is the method used for SGE so that would be A-OK if you prefer it). If you need to keep it more general and just use the default that's alright and I can just modify a fork for our use here.

Thanks again!

Mario

roryk commented 10 years ago

Hi Mario,

Step two of your troubles is what I spent half an hour this morning figuring out, so at least we're in good company forgetting to pass the account parameter to the Launcher classes. :)

Great, I think things should be all set-- could you test using the example/example.py script and passing the parameters you need to set with --resources to make sure it's all good?

@chapmanb has had a bit of chatter regarding people asking for resource estimations for the pipelines in bcbio-nextgen, so it is possible that setting an estimated --timelimit inside of bcbio-nextgen automatically might end up happening.

mariogiov commented 10 years ago

Interesting about the timelimit parameters! That's actually been on my to-do list for a month or so. If you guys want to talk about how to work out the estimations for that we will definitely be producing a lot of data we can analyze as soon as I get the new automation infrastructure running in production.

The bcbio_nextgen pipeline is producing sbatch files and queueing jobs now no problemo, but I'll run that example script first thing tomorrow morning to make sure everything works alright.

chapmanb commented 10 years ago

Mario; My general thought process was to have a set of test datasets that a user could run on their machines to get timing estimates that would then be automatically plugged in and scaled depending on the size and number of samples in a run. In general this is going to be hard though since there is so much variability between runs due to depth and other tricky to estimate parameters.

mariogiov commented 10 years ago

@chapmanb Well depth should be something we can estimate, but you're right that there are a lot of variables. I'm thinking to revisit this once we get everything set up, which I'm aiming for by the end of the year but we'll see how things go.

@roryk The example script is definitely working as well, so you can close this issue out unless there's something else we need to address. Thanks for the quick work!

roryk commented 10 years ago

Great, thank you very much for the bug report and the investigative work about what was wrong!

roryk / ipython-cluster-helper

SLURM 2.6 not queueing naythang #11