psnc-qcg / QCG-PilotJob

The QCG Pilot Job service for execution of many computing tasks inside one allocation
Apache License 2.0
11 stars 2 forks source link

Incorrect core numbers / cpu ids passed to taskset #164

Open LourensVeen opened 2 years ago

LourensVeen commented 2 years ago

When running under Slurm, QCG-PilotJob gets the assigned resources from Slurm, and then uses taskset to tie jobs to the CPU core ids reported by Slurm. I'm getting errors here, because the Slurm-reported core ids don't necessarily match the kernel scheduler's cpu ids: https://slurm.schedmd.com/cpu_management.html#numbering

I'm running on a machine with Intel Xeon E5-2680V4 CPUs. These have 14 physical cores, 28 threads, and a funny mapping of Linux logical core ids to physical core ids (which the kernel gets via ACPI from the firmware):

processor 0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27
core id   0   2   4   6   9   11  13  0   2   4   6   9   11  13  1   3   5   8   10  12  14  1   3   5   8   10  12  14

Note the lack of a physical core 7, the numbers go 0-6,8-14! Also, /proc/cpuinfo on these things reports that there are two sockets, even though as far as I can tell this model is a single 14-core CPU. Perhaps Intel stuck two 8-core dies with one core disabled each into a single package? Well, however it happens, this is what we get.

The cluster has hyperthreading enabled, but my model isn't helped by it so I want to have only the 14 physical cores available. To achieve this, I run with sbatch -N <n> --num-tasks-per-node 14 <job.sh>. On most machines, that will give you Slurm core ids 0-13, corresponding to Linux core ids 0-13. On this machine, I get for -N 3 --num-tasks-per-node 14 this output from scontrol show -o --detail job <job_id> (irrelevant bits cut out):

Nodes=c06s25 CPU_IDs=0-13 Mem=0 GRES=   Nodes=c06s26 CPU_IDs=3-16 Mem=0 GRES=   Nodes=c06s27 CPU_IDs=0-13 Mem=0 GRES=

So Slurm gives us 0-13 on two nodes, but 3-16 on the other. And looking at the mapping above, cores 0-13 map to physical cores 0, 2, 4, 6, 9, 11 and 13 twice, leaving 1, 3, 5, 8, 9, 10, 12 and 14 idle, which is definitely not what we want.

So what do we actually get? Slurm seems to set the scheduler affinity of the job script, and in a way that matches the hardware, not the output of scontrol. This program

import os

print(os.sched_getaffinity(0))

prints

{0, 1, 2, 3, 4, 5, 6, 14, 15, 16, 17, 18, 19, 20}

on a node where we have Slurm logical cores 0-13. Looking at the mapping above, that actually makes sense, as these are on 14 different physical cores.

Unfortunately, QCG-PilotJob will pass core ids in the range 0-13 to taskset -c when it launches a job, and taskset will report an error if the number is in the range 7-13, because those aren't actually available in the cgroup it's in.

This seems a bit tricky to solve. One way to do it would be to have the manager ask Slurm only for which nodes it has, then launch the agents, and then each agent inspects the local resources and passes those back to the manager. But that's quite a bit of plumbing changes to implement.

Another option would be to have the manager ask Slurm for core ids as it does now, but only take the total number and always remap this to 0-(n-1) (so the 3-16 node above would get 0-13 like the others). These numbers would then be reported by parse_slurm_resources(), used by the Scheduler to allocate resources, then sent to the agents, which on start-up call os.sched_getaffinity(0), convert the output to a (sorted?) list, and index into that to figure out which cores to pass to taskset. That doesn't require any architectural changes, so it should be easier to implement.

pkopta commented 2 years ago

I will try to implement the second option which for me sounds very reasonable.

pkopta commented 2 years ago

Proposed solution can be found in #169

LourensVeen commented 2 years ago

Thanks! Looks like you went for option 1 after all? I'll try to test this on the machine with the funny CPUs and see if it works, I'll let you know.

pkopta commented 1 year ago

Hi @LourensVeen, did you manage to test this fix ?

LourensVeen commented 1 year ago

Ah, not yet, and I have a workshop next week, but I'll try ASAP.