michellab / Cluster

This repository is used for tracking any issues regarding the cluster
2 stars 0 forks source link

gpu resources allocation failing #7

Closed jmichel80 closed 9 years ago

jmichel80 commented 9 years ago

at least with the script below julien@node009:~/projects/Thrombin/dataset00001/somd/3RML~3RMM/free/output/lam-0.00$ cat serial-gpu.sh

!/bin/sh

SBATCH -o somd-serial-gpu.out

SBATCH -p GPU

SBATCH --gres=gpu:tesla:1

SBATCH --time 24:00:00

source /etc/profile.d/modules.sh module load cuda module load openmm

srun --gres=gpu:tesla:1 ~/sire.app/bin/somd-freenrg -C ../../input/freenrg.cfg -l 0.00 -p CUDA

wait

I get julien@node009:~/projects/Thrombin/dataset00001/somd/3RML~3RMM/free/output/lam-0.00$ sbatch serial-gpu.sh sbatch: error: Batch job submission failed: Requested node configuration is not available

But the resources gpu:tesla seem to exist on this node

julien@node009:~/projects/Thrombin/dataset00001/somd/3RML~3RMM/free/output/lam-0.00$ scontrol show node node006 NodeName=node006 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=32 CPULoad=1.09 Features=(null) Gres=gpu:tesla:4 NodeAddr=node006 NodeHostName=node006 Version=14.11 OS=Linux RealMemory=1 AllocMem=0 Sockets=32 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2015-06-08T16:57:41 SlurmdStartTime=2015-07-02T16:37:21 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Is this a typo in the submission script ? Can't see something obvious.

ppxasjsm commented 9 years ago

No I can't see anything obvious. Might be worth while trying to run the unit_test jobs associated with generic resources. They can be found here: /manager/slurm-slurm-14-11-7-1/testsuite/expect And this is test to run: test1.62 Test of gres/gpu plugin (if configured).

jmichel80 commented 9 years ago

After extensive tests I think I got slurm to work with gpus now. I believe slurm on client nodes was confused by slurm.conf installed in /usr/local/slurm/ . Although I removed this file and the /etc/init.d/slurm script correctly looks for the config file in /home/common/slurm, the jobs would not run correctly until I put a soft link in /usr/local/slurm to /home/common/slurm .

In the process, I got rid of the gpu types variable in gres.conf and define node features at the moment. It may be that types work but at the moment we are using features now.

Sample working slurm scripts are here

/home/julien/slurm-scripts