radical-cybertools / radical.saga

A Light-Weight Access Layer for Distributed Computing Infrastructure and Reference Implementation of the SAGA Python Language Bindings.
http://radical-cybertools.github.io/saga-python/
Other
83 stars 34 forks source link

TACC Longhorn: `srun: error: Unable to allocate resources: Requested GRES option unsupported by configured SelectType plugin` #835

Closed lee212 closed 3 years ago

lee212 commented 3 years ago

The error message of starting a single job on TACC Longhorn indicates the failure is caused by the --gpus option,

-----------------------------------------------------------------
           Welcome to the Longhorn Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Verifying availability of your home dir (/home/06079/tg853783)...OK
--> Verifying availability of your scratch dir (/scratch/06079/tg853783)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (development)...OK
--> Checking available allocation (MCB20024)...OK
sbatch: error: Batch job submission failed: Requested GRES option unsupported by configured SelectType plugin

when the script is prepared like:

#!/bin/sh

#SBATCH -N 1
#SBATCH -n 40
#SBATCH --ntasks-per-node=40
#SBATCH --gpus=4
#SBATCH -J "pilot.0000"
#SBATCH -D "/scratch/06079/tg853783/radical.pilot.sandbox/re.session.login2.longhorn.tacc.utexas.edu.tg853783.018825.0011/pilot.0000/"
#SBATCH --output "bootstrap_0.out"
#SBATCH --error "bootstrap_0.err"
#SBATCH --partition "development"
#SBATCH --time 00:10:00

It seems SAGA builds the script with the option, in specific,

https://github.com/radical-cybertools/radical.saga/blob/2e88d31b0de2cc54511ec9d59da93b57a33d2ce7/src/radical/saga/adaptors/slurm/slurm_job.py#L655

The job is successfully submitted if GRES option --gpus is removed from the slurm script.

and the stack used for this test:

  python               : /scratch/06079/tg853783/conda/ddmd/bin/python3
  pythonpath           :
  version              : 3.7.10
  virtualenv           : /scratch/06079/tg853783/conda/ddmd

  radical.entk         : 1.6.7
  radical.gtod         : 1.6.7
  radical.pilot        : 1.6.7
  radical.saga         : 1.6.10
  radical.utils        : 1.6.7