Open abouteiller opened 3 months ago
I mostly agree with Aurelien's description. Additionally, keeping Slurm would be highly desirable, as people are already familiar with it and it is used on Frontier, Perlmutter, and other places.
For mode 3: GitHub Actions/backfill, is the suggestion that GitHub actions would never run (or at least never start) on a node when users are logged into that node (via either mode 1 or 2)? In some ways this is nice, but if nodes were very busy with users, it could mean that GitHub Actions face starvation. We could see how it worked and adjust if there were issues. I have been blocked from merging PRs in the past because someone was using 100% of the GPU memory overnight, so checks could not run (or actually failed) — but that was a rare occurrence that was resolved by email.
SLATE is moving towards using 4 GPUs in its CI testing, so that needs to be feasible.
I envision 3 operation modes
In italic are 'nice to have' features that may or may not be difficult to achieve.
Mode1: shared usage for debugging (default mode for human users)
srun
orsalloc
without any other qualifier)-wleconte
,-N6 -pbezout
srun -N3 -pbezout
, user B callssrun -N3 -pbezout
, the workload should spread on all 6 Bezout nodes before we reuseNot needed: fine grain allocation of resources
Difficulty: number of "access tokens" may still be required for load balancing purposes (3), and using core allocation is a poor substitute, because it has effect on
cgroups
and other actual access policy to the hardware resources within the allocation.Mode2: exclusive usage for production runs (human user requested)
srun -exclusive
may or may not do what we want based on requirements for mode 3: backfill), so maybesrun --reservation=exclusive
,srun --reservation=exclusive-nightly
etc.srun
andssh
cannot login while the job is active (slurm_pam
should do that out-of-the-box)exclusive nighly
mode may terminate existingsrun
andssh
accesses (using theslurm_pam
module should be able to do both prevention and termination for ssh access, but ssh termination may require some customization).Not needed (actually problematic): fine grain allocation of resources, I want guarantee I have a full node and no other stuff is running at the same time (including Jenkins, GH actions, ...)
Difficulty: if we have the fine grain allocation scheduler active, we can simply reserve all resources, but users may still want to execute multiple
srun
inside a givensalloc/sbatch
and spread subjobs however they want, I think that should work out-of-the-box but needs verifiedMode 3: GH actions/backfill
Difficulty: using fine-grain allocation in one mode forces us to use the fine-grain scheduler in all modes, which we don't care much about and may complicate how we allocate shared jobs.