Closed jmichel80 closed 9 years ago
No I can't see anything obvious. Might be worth while trying to run the unit_test jobs associated with generic resources. They can be found here: /manager/slurm-slurm-14-11-7-1/testsuite/expect And this is test to run: test1.62 Test of gres/gpu plugin (if configured).
After extensive tests I think I got slurm to work with gpus now. I believe slurm on client nodes was confused by slurm.conf installed in /usr/local/slurm/ . Although I removed this file and the /etc/init.d/slurm script correctly looks for the config file in /home/common/slurm, the jobs would not run correctly until I put a soft link in /usr/local/slurm to /home/common/slurm .
In the process, I got rid of the gpu types variable in gres.conf and define node features at the moment. It may be that types work but at the moment we are using features now.
Sample working slurm scripts are here
/home/julien/slurm-scripts
at least with the script below julien@node009:~/projects/Thrombin/dataset00001/somd/3RML~3RMM/free/output/lam-0.00$ cat serial-gpu.sh
!/bin/sh
SBATCH -o somd-serial-gpu.out
SBATCH -p GPU
SBATCH --gres=gpu:tesla:1
SBATCH --time 24:00:00
source /etc/profile.d/modules.sh module load cuda module load openmm
srun --gres=gpu:tesla:1 ~/sire.app/bin/somd-freenrg -C ../../input/freenrg.cfg -l 0.00 -p CUDA
wait
I get julien@node009:~/projects/Thrombin/dataset00001/somd/3RML~3RMM/free/output/lam-0.00$ sbatch serial-gpu.sh sbatch: error: Batch job submission failed: Requested node configuration is not available
But the resources gpu:tesla seem to exist on this node
julien@node009:~/projects/Thrombin/dataset00001/somd/3RML~3RMM/free/output/lam-0.00$ scontrol show node node006 NodeName=node006 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=32 CPULoad=1.09 Features=(null) Gres=gpu:tesla:4 NodeAddr=node006 NodeHostName=node006 Version=14.11 OS=Linux RealMemory=1 AllocMem=0 Sockets=32 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2015-06-08T16:57:41 SlurmdStartTime=2015-07-02T16:37:21 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Is this a typo in the submission script ? Can't see something obvious.