Closed chnixi closed 1 year ago
This may be an issue with the Slurm configuration. The slurm command being used by the wrapper applications looks valid:
$ crc-interactive.py -m -c 28 -n 2 -z
srun -M mpi --export=ALL --time=1:00:00 --mem=1g --ntasks-per-node=28 --nodes=2 --pty bash
Running the slurm command gives the error you mentioned above. srun
should select the default partition, but for some reason it is selecting the partition opa
. This partition does not exist:
$ sinfo -a -M mpi
CLUSTER: mpi
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
opa-high-mem* up infinite 36 alloc opa-n[96-131]
mpi up infinite 1 down* mpi-n61
mpi up infinite 1 drain mpi-n129
mpi up infinite 22 mix mpi-n[6-7,30-35,46-47,54-55,86-87,90-91,107,113-114,130-132]
mpi up infinite 53 alloc mpi-n[28-29,48-53,56-60,62-85,93-106,115-116]
mpi up infinite 59 idle mpi-n[0-5,8-27,36-45,88-89,92,108-112,117-128,133-135]
scavenger up infinite 36 alloc opa-n[96-131]
@Comeani might have a better idea how the default partition is configured.
The previous default partition for the MPI cluster was opa
but has now been retired. The new default partition is opa-high-mem
and set correctly in the MPI head nodes Slurm configuration, but was still being referenced as the default partition in the job_submit.lua
on the MPI head node. This has been updated to point to the new default partition opa-high-mem
.
-- If not partition, set to default
if job_desc.partition == nil then
job_desc.partition = "opa-high-mem"
end
When running the crc-interactive wrapper to request mpi nodes, the default partition is set to opa when no partition is explicitly specified via the -p argument. It would be good to set this partition to mpi or another valid partition so users don't need to specify this explicitly every time.
crc-interactive.py -m -c 28 -n 2 srun: error: invalid partition specified: opa srun: error: Unable to allocate resources: Invalid partition name specified