gvlproject / gvl.ansible.playbook

Playbook for building the Genomics Virtual Laboratory
7 stars 4 forks source link

Slurm not running jobs simultaneously #73

Closed jessicachung closed 7 years ago

jessicachung commented 7 years ago

EDIT: Issue was previously titled "Change default destination in Galaxy job_conf.xml"

Would anyone object if I changed the default job destination to "slurm_cluster" in the job_conf.xml file? Currently it's set to "default_dynamic_job_wrapper". This means that most jobs use multiple CPUs if available (usually the wrappers specify 4 CPUs with ${GALAXY_SLOTS:-4}). If we change it to default to the "slurm_cluster" destination, jobs will use one CPU unless specified otherwise.

The dynamic job destination is useful for personal GVLs, but when starting large VMs for tutorials, having jobs use one CPU only is preferred (i.e. small datasets which don't benefit from parallelisation plus it makes things faster with lots of people on the same machine waiting for their jobs to run).

Thoughts? Does anyone have a preference either way?

jessicachung commented 7 years ago

Actually, the slowness I was experiencing with galaxy might have been caused by a slurm bug. In 4.2 beta, I can only get one slurm job to run at a time, even if they're only requesting 1 CPU and there are resources available (tested both with galaxy and on the command line).

We probably don't need to change the job_conf.xml. Just remember to manually change the default destination whenever you're starting a machine up for a workshop.

Slugger70 commented 7 years ago

So slurm might be blocking jobs on the basis that it by default blocks out a whole node per job unless you tell it otherwise.. In 4.1 we did that but may not have in 4.2.

Slugger70 commented 7 years ago

According to the slurm docs, we need to change select/linear to select/cons_res in the slurm config files in /etc... "In the case where select/cons_res is not enabled, the normal Slurm behaviors are not disrupted. The only changes, users see when using the select/cons_res plugin, are that jobs can be co-scheduled on nodes when resources permit it."

Slugger70 commented 7 years ago

The problem is that we already have this setting in our slurm conf.. hmmm.. Maybe ask Chris for advice Jess..

jessicachung commented 7 years ago

The slurm conf settings look identical to 4.1. I think I've narrowed it down to a systemctl/systemd issue. Will keep you updated.

nuwang commented 7 years ago

Is the process for starting slurmctld different with systemd?: https://github.com/galaxyproject/cloudman/blob/master/cm/services/apps/jobmanagers/slurmctld.py#L169

jessicachung commented 7 years ago

Changing systemctl didn't fix the single job per node issue.

But yes, the PID is different for systemctl. I had to change the PID file location in /lib/systemd/system/slurmctld.service and /lib/systemd/system/slurmd.service. Also starting slurm up in cloudman results in systemctl thinking slurm isn't running, even though it is.

I'll ask the local slurm experts the next time I see them.

jessicachung commented 7 years ago

Talked to Ben and found the solution :)

The partition needs to be set to shared and jobs need to ask for a limited amount of memory.

In the slurm.conf file, add Shared=YES to the main partition.

PartitionName=main Nodes=master,placeholder Default=YES MaxTime=INFINITE State=UP Shared=YES

We also need to limit the amount of memory when submitting jobs, otherwise it asks for the max memory. We'll need to find a way to change the default MinMemoryNode.

nuwang commented 7 years ago

Great! We can fix this at the cloudman level, so we won't need to build a new image or fs. I'll change the slurm conf. template.

nuwang commented 7 years ago

Have just updated the buckets. Let me know whether all is ok.