icl-utk-edu / cluster

2 stars 0 forks source link

Discussion about Slurm queue policies #4

Open abouteiller opened 3 months ago

abouteiller commented 3 months ago

I envision 3 operation modes

In italic are 'nice to have' features that may or may not be difficult to achieve.

Mode1: shared usage for debugging (default mode for human users)

  1. resources are allocated in non-exclusive mode by default (that is running srun or salloc without any other qualifier)
  2. multiple users can coexist on the same node at the same time, especially if they requested the same resource explicitly (e.g., -wleconte, -N6 -pbezout
  3. Prefer not sharing if possible: user A calls srun -N3 -pbezout, user B calls srun -N3 -pbezout, the workload should spread on all 6 Bezout nodes before we reuse

Not needed: fine grain allocation of resources

Difficulty: number of "access tokens" may still be required for load balancing purposes (3), and using core allocation is a poor substitute, because it has effect on cgroups and other actual access policy to the hardware resources within the allocation.

Mode2: exclusive usage for production runs (human user requested)

  1. resources are allocated in exclusive mode if the user so specifies (how that gets specified is not completely clear yet, the srun -exclusive may or may not do what we want based on requirements for mode 3: backfill), so maybe srun --reservation=exclusive, srun --reservation=exclusive-nightly etc.
  2. A single user can use the resource, that is, other sharing-mode srun and ssh cannot login while the job is active (slurm_pam should do that out-of-the-box)
  3. exclusive jobs during the day have a short time limit (e.g. 1h) to prevent resource hoarding, exclusive-nightly have a longer time limit (e.g., until 7am next business day).
  4. The exclusive nighly mode may terminate existing srun and ssh accesses (using the slurm_pam module should be able to do both prevention and termination for ssh access, but ssh termination may require some customization).
  5. exclusive nightly jobs are uninterruptible until 7am the next business day, but may overstay until a competing shared or exclusive job exist in the queue that would use these resources is actually submitted

Not needed (actually problematic): fine grain allocation of resources, I want guarantee I have a full node and no other stuff is running at the same time (including Jenkins, GH actions, ...)

Difficulty: if we have the fine grain allocation scheduler active, we can simply reserve all resources, but users may still want to execute multiple srun inside a given salloc/sbatch and spread subjobs however they want, I think that should work out-of-the-box but needs verified

Mode 3: GH actions/backfill

  1. GitHub actions, Jenkins and other automations use a backfill scheduler
  2. The backfill jobs can be interrupted by the arrival of user-created jobs, and that doesn't cause the CI pipeline to generate an error, just reschedule the pipeline to a later date (not sure how difficult that is to actually do)
  3. backfill uses the fine grain allocation policy (so that we can run more actions at the same time, if for example we know they require only 1 GPU, and we have 8, we may run 8 ctest simultaneously)

Difficulty: using fine-grain allocation in one mode forces us to use the fine-grain scheduler in all modes, which we don't care much about and may complicate how we allocate shared jobs.

mgates3 commented 3 months ago

I mostly agree with Aurelien's description. Additionally, keeping Slurm would be highly desirable, as people are already familiar with it and it is used on Frontier, Perlmutter, and other places.

For mode 3: GitHub Actions/backfill, is the suggestion that GitHub actions would never run (or at least never start) on a node when users are logged into that node (via either mode 1 or 2)? In some ways this is nice, but if nodes were very busy with users, it could mean that GitHub Actions face starvation. We could see how it worked and adjust if there were issues. I have been blocked from merging PRs in the past because someone was using 100% of the GPU memory overnight, so checks could not run (or actually failed) — but that was a rare occurrence that was resolved by email.

SLATE is moving towards using 4 GPUs in its CI testing, so that needs to be feasible.