Add stacked cgroup and affinity TaskPlugins for slurm as recommended to restrict jobs to requested node resources.
Note this requires cgroups.conf to be defined which it is by the currently-used openhpc role, see here.
Given the non-obvious nature of this change manual testing was carried out as follows.
Compute nodes were changed to Leafcloud en1.medium to get 2x cpus.
Note sbatch can't be used directly to test restriction works OK:
$ sbatch --ntasks=1 --wrap "srun --ntasks=2 hostname"
srun: error: Unable to create step for job 2: More processors requested than permitted
So a python multiprocessing program mp.py was created (see below) and run:
$ sbatch -n 1 mp.py
[rocky@debug-login-0 tests]$ cat slurm-4.out cat slurm-5.out
# without TaskPlugin
n procs: 2
Hello from Process 1! - Running on CPU core(s): {0, 1}
Hello from Process 2! - Running on CPU core(s): {0, 1}
Both processes have finished.
# with TaskPlugin: task/cgroup,task/affinity
n procs: 2
Hello from Process 2! - Running on CPU core(s): {0}
Hello from Process 1! - Running on CPU core(s): {0}
Both processes have finished.
All the above was carried out using (default) RL9. Additional testing was also carried out using RL8:
site: OK
hpctests: OK
mp.py: OK
[rocky@debug-login-0 tests]$ sbatch -n 1 mp.py
Submitted batch job 11
[rocky@debug-login-0 tests]$ cat slurm-11.out
n procs: 2
Hello from Process 1! - Running on CPU core(s): {0}
Hello from Process 2! - Running on CPU core(s): {0}
Both processes have finished.
checked for slurmctld errors: OK
mp.py:
import multiprocessing, os
print('n procs:', multiprocessing.cpu_count())
def print_message(message):
core_number = os.sched_getaffinity(0)
print(f"{message} - Running on CPU core(s): {core_number}")
if __name__ == "__main__":
# Define messages for each process
messages = ["Hello from Process 1!", "Hello from Process 2!"]
# Create two processes
processes = []
for msg in messages:
process = multiprocessing.Process(target=print_message, args=(msg,))
processes.append(process)
# Start each process
for process in processes:
process.start()
# Wait for both processes to finish
for process in processes:
process.join()
print("Both processes have finished.")
Add stacked
cgroup
andaffinity
TaskPlugins for slurm as recommended to restrict jobs to requested node resources.Note this requires
cgroups.conf
to be defined which it is by the currently-usedopenhpc
role, see here.Given the non-obvious nature of this change manual testing was carried out as follows.
Compute nodes were changed to Leafcloud en1.medium to get 2x cpus.
Note
sbatch
can't be used directly to test restriction works OK:So a python multiprocessing program
mp.py
was created (see below) and run:All the above was carried out using (default) RL9. Additional testing was also carried out using RL8:
mp.py
: