Jobs can over-subscribed into each other when sharing the same worker node

ickc commented 1 year ago

@rwf14f, I am encountering some strange issues which looks like an interactive node and worker node is oversubscribing the same node.

This concerns the following 2 jobs:

OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
khcheung ID: 309      7/19 13:55      _      1      _      1 309.0
khcheung ID: 331      7/22 00:49      _      1      _      1 331.0

For job 309, it is an interactive job landed on . This job was started by the following config:

RequestMemory=16384
RequestCpus=8
queue

This node has 8 physical CPU (2 socket of Intel(R) Xeon(R) Gold 5122 CPU each has 4 cores). So this reservation should have made the node exclusive to the job.

Strangely, job 331 has landed on the same node apparently from this line of log:

001 (331.000.000) 2023-07-22 00:50:11 Job executing on host: <195.194.108.112:9618?addrs=195.194.108.112-9618+[2001-630-22-d0ff-b226-28ff-fe53-755c]-9618&alias=wn1906370.in.tier2.hep.manchester.ac.uk&noUDP&sock=startd_2449_6386>

This was submitted to the vanilla universe with this bit of relevant config:

request_cpus            = 8
request_memory          = 32G
request_disk            = 32G

So apparently these 2 jobs are oversubscribed to the same node.

rwf14f commented 1 year ago

They use SMT:

lscpu | grep "^CPU(s)"
CPU(s):                16

Two 8 core jobs fit on there just fine.

ickc commented 1 year ago

@rwf14f, that's what I fear. Usually when people requesting CPUs, they are requesting cores, not threads. Eg you can see from the env var that they will set various ..._NUM_THREADS=8 if they are requesting 8 CPUs. In this case it is oversubscribing the nodes and in most CPU intensive workload it is oversubscribing. The default should be binding to cores, not threads, and only in cases that multithreading is beneficial after profiling one should bind to threads instead.

rwf14f commented 1 year ago

Remember that this is an HTC cluster for HEP. Using SMT works for us and our workloads don't have a problem with this. When you provide your own hardware then switching off SMT on those is an option if you wish to do so.

ickc commented 1 year ago

@rwf14f, I think the more important question is if there's any constraint exists to avoid over-subscription? i.e. if 2 different jobs (Job A and Job B) has their HTCondor processes (Job A process 1, Job B process 1) assigned to the same physical node, multithreading or not, would CPUs be bind to individual processes, such that the Job A process 1 has exclusive access to the say first 8 logical CPUs and Job B process 1 has exclusive access to the other logical CPUs?

Thanks.

Edit: to clarify the question, my worry is about when Job A is requested by us, which try to bind to CPU core, and Job B is by HEP people which bind to CPU threads. Then if they will oversubscribe? E.g. if they have exclusive access to their own pool of CPU (logical) cores then it should be fine.

ickc commented 1 year ago

Assigned to me just now, to see if this can be configured from our side first.

rwf14f commented 1 year ago

We're not using CPU pinning or similar, it's the OS that decides which process runs on which (SMT-)core. Are you trying to use pinning with MPI options ? Are you also worried about NUMA-nodes ?

ickc commented 1 year ago

Hi, @rwf14f, right now I'm not starting to worry about NUMA yet. I'm trying to understand the implication of having 2 different HTCondor processes on the same physical nodes and whether each of them have "exclusive" access to their own pool of CPUs (i.e. the CPU affinity / pinning here.) I.e. I'm trying to understand if an over subscribed process will interfere another non-over-subscribed process.

I'll run some tests today and figure out the implications first. Thanks!

ickc commented 9 months ago

Our documentation has mentioned this now, and the wrapper script I provided to launch MPI jobs has "fixed" this from our side. It still doesn't prevent other process on the same physical node oversubscribing into our requested CPUs, but this is a "won't fix" issue.

The solution is for us to always request a whole physical node, which is easier said than done as this is not the kind of constraints HTCondor allows. Constraints on the kind of nodes to launch is needed too.

And for those situations that we have to share same physical nodes with other processes, strangely enough, is to oversubscribe from our side too (which is the defaults if not using my wrapper scripts.) I will document this more in details.

ickc commented 7 months ago

Reopen in light of https://github.com/simonsobs-uk/data-centre/issues/35#issuecomment-1857536965

Copied from a presentation on 2023-08-02 from an internal DC meeting:

Over-subscription

Clearing up some confusion on the matter of oversubscription:

not about if SMT should be enabled in the BIOS (enable is good)

not about if request_cpus is factor of 8s.

it is not even about if request_cpus means logical cores.

it is a combination of

having SMT enabled such that a CPU core is by default a logical core (again, a good choice),

request_cpus=N from HTCondor will then assign N logical cores to the job (which is normal), and

Blackett configuration by default is setting OMP_NUM_THREADS=N. This together with the above means over-subscription. See Process and Thread Affinity - NERSC Documentation for example (This does not explain why but tells you the best practice. The why is related to how SMT works, and the fact that we are focusing on scientific computing, e.g. when people parallelize a region using OpenMP in a typical situation, all cores are doing the same thing on the same kind of data that makes SMT useless in this case.)

it is also partially related to the fact that we cannot have exclusive nodes. In HTCondor, nodes are always shared (heterogeneous nodes is a contributing factor that it is hard to get N processes with M CPUs each to land on N physical nodes.)

Obviously one can override the OMP_NUM_THREADS within their job, therefore the main question is if another HTCondor process can interfere with your process when over-subscription is occurring.

This is also related to having our own "partition" on Blackett vs. sharing with the whole machine. This does not imply DIY with "significant overhead".

Over-subscription test

simple mat-mul, $n = 20,000$

2 HTCondor processes on the same physical node

not: over-subscribed: 0:21.90s

over-subscribed: 1:01.63

The information is brief as it is an internal meeting, but the key is that I requested 2 separate jobs landing on the same physical node (where each occupied half) and I'm able to watch both of them over-subscribing into the CPU usage of whole node, which resulted in much slower performance comparing to not over-subscribing. I should emphasize that the over-subscription from 1st job is affecting the 2nd job, so it most probably means the cgroups is not working/configured as expected, which seems contradicting to what is said in https://github.com/simonsobs-uk/data-centre/issues/35#issuecomment-1857536965:

If a job requests 1 CPU but attempts to use 2 then cgroups will ensure that it's only using 1, but only on a fully loaded worker node where all resources are claimed and used by jobs.

ickc commented 7 months ago

More notes:

It could be partly due to a bug in HTCondor, see point (3) in https://github.com/simonsobs-uk/data-centre/issues/35#issuecomment-1857918798.

The recommendation that we can implement immediately is to set *_NUM_THREADS equals to no. of physical cores instead of logical cores. See notes above and point (1) in https://github.com/simonsobs-uk/data-centre/issues/35#issuecomment-1857918798.

ickc commented 7 months ago

@rwf14f, I just found out that some nodes seems to have multithreading disabled, such as

wn3805340.tier2.hep.manchester.ac.uk:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping:              1
CPU MHz:               1497.711
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4190.53
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

simonsobs-uk / data-centre

Jobs can over-subscribed into each other when sharing the same worker node #10

Over-subscription

Over-subscription test