uabrc / devops-docs

https://docs.rc.uab.edu/devops-docs/
Apache License 2.0
1 stars 7 forks source link

Slurm reservations #36

Open wwarriner opened 1 year ago

wwarriner commented 1 year ago

Creating a reservation

You should be able to create a reservation, I believe Ops, Dev and DataSci have this authority, example 30 day res for c0220-c0223 for 3 users staring now

scontrol create reservation Reservation=$resv-name starttime=now duration=30-00:00:00 Nodes=c\[0220-0223] User=$user-a

Using a reservation (for researchers)

And I also noticed we need to use --reservation=$resv-name to make use of it with sbatch and srun.

Updating a reservation

scontrol update reservations Reservation=$resv-name User+=$user-b

Adding nodes to a reservation

If you do add new nodes, you’ll have to delete the reservation and recreate. To avoid jobs jumping on the nodes in the short time between delete and create, you should first drain all of the reservation nodes (make sure to update the node list in all of the commands)

for node in c0{232..235}; do scontrol_admin update NodeName="$node" State=drain Reason="RCOPS: Creating $resv-name reservation"; done

scontrol delete reservationname=$resv-name
scontrol create reservation Reservation=$resv-name starttime=now duration=35-00:00:00 Nodes=c\[0220-0223] User=$user-a,$user-b

for node in c0{232..235}; do scontrol_admin update NodeName="$node" State=undrain Reason="RCOPS: Created $resv-name reservation"; done

Canceling/ending reservation

$ scontrol show res
ReservationName=$resv-name StartTime=2023-10-19T10:14:33 EndTime=2023-11-18T09:14:33 Duration=30-00:00:00
   Nodes=c[0232-0235] NodeCnt=4 CoreCnt=512 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=512
   Users=$user-a,$user-b Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

$ scontrol delete reservationname=$resv-name
wwarriner commented 1 month ago

If you create a reservation for the purposes of research workflow facilitation, you will still encounter QoS and job time limit restrictions. If these are barriers, a temporary partition will need to also be created. The ops team will need to perform the node/partition related steps below.

  1. Create partition
  2. Drain desired nodes
  3. When drained, move desired nodes to new partition
  4. Set up reservation per above comment
  5. Perform work
  6. Drain nodes
  7. When drained, move nodes back to original partition
  8. Delete new partition
wwarriner commented 4 weeks ago

A sample partition definition for posterity

PartitionName=$name Default=NO MinNodes=1 MaxNodes=5 MaxTime=6-06:00:00 DefaultTime=01:00:00 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=8 OverSubscribe=NO State=UP Nodes=c[0232-0235]

You can force access to the partition to require reservations using ReqResv=yes. This allows dynamic limitation of who is authorized to use the partition.