NUMA not fully functional when running NUMA aware workloads

dle-hpe commented 2 months ago

Bug Report

Description

When enabling the following Kubelet feature flags: CPUMManager, MemoryManager and TopoloyManager. Enabling topology-manager-policy: restricted causes the deploy to fail with this error Resources cannot be allocated with Topology locality

If we change topology-manager-policy: best-effort. The deploy works. NUMA appears enabled. NUMA functionality does not work and blocks us from getting RDMA working with our GPUs and MLNX cards. We expected to see some PIX instead of all PHB connections


# dmesg |grep -i NUMA
[    0.003675] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
[    0.003677] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0xf43fffffff] -> [mem 0x00000000-0xf43fffffff]
 [    7.955899] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    8.247382] pci_bus 0000:00: Unknown NUMA node; performance will be reduced

# cat /var/log/kern.log |grep -i numa
Jun 28 19:01:58 packer kernel: [    0.000000] No NUMA configuration found
Aug  6 14:57:17 gh-02 kernel: [    0.003675] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
Aug  6 14:57:17 gh-02 kernel: [    0.003677] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0xf43fffffff] - [mem 0x00000000-0xf43fffffff]
Aug  6 14:57:17 gh-02 kernel: [    7.955899] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
Aug  6 14:57:17 gh-02 kernel: [    8.247382] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
 # lspci
 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
 00:01.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:01.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:01.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:01.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:01.4 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:01.5 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:01.6 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:01.7 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:02.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:02.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:02.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:02.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:02.4 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:02.5 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:02.6 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:02.7 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:03.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:03.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:03.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:03.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:03.4 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:03.5 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:03.6 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:03.7 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:04.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:04.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:04.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:04.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:04.4 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:04.5 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:04.6 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:04.7 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:05.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:05.1 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:05.2 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:05.3 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:05.4 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:05.5 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:05.6 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
 01:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01)
 05:00.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI (rev 01)
 06:00.0 Communication controller: Red Hat, Inc. Virtio console (rev 01)
 07:00.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01)
 08:00.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01)
 09:00.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01)
 0a:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
 0b:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
 0c:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
 0d:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
 0e:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
 0f:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
 10:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
 11:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
 12:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
 13:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
 14:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
 15:00.0 Bridge: NVIDIA Corporation Device 22a3 (rev a1)
 16:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
 17:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
 18:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
 19:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
 1a:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
 1b:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
 1c:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
 1d:00.0 3D controller: NVIDIA Corporation Device 2330 (rev a1)
 1e:00.0 System peripheral: Broadcom / LSI Device 02b2 (rev b0)
 1f:00.0 System peripheral: Broadcom / LSI Device 02b2 (rev b0)
 20:00.0 System peripheral: Broadcom / LSI Device 02b2 (rev b0)
 21:00.0 System peripheral: Broadcom / LSI Device 02b2 (rev b0)
 22:00.0 Host bridge: Intel Corporation Device 1bfe (rev 11)
 23:00.0 Host bridge: Intel Corporation Device 0998
 24:00.0 Host bridge: Intel Corporation Device 0998
 25:00.0 System peripheral: Intel Corporation Device 09a4 (rev 20)
 26:00.0 Memory controller: PMC-Sierra Inc. Device 4128
 # lscpu
 Architecture:            x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Address sizes:         46 bits physical, 57 bits virtual
   Byte Order:            Little Endian
 CPU(s):                  112
   On-line CPU(s) list:   0-111
 Vendor ID:               GenuineIntel
   Model name:            Intel(R) Xeon(R) Platinum 8462Y+
     CPU family:          6
     Model:               143
     Thread(s) per core:  2
     Core(s) per socket:  28
     Socket(s):           2
     Stepping:            8
     BogoMIPS:            5600.00
     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology cpuid tsc_known_freq pni pcl
                          mulqdq dtes64 vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr
                          _shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_
                          vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxl
                          dtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
 Virtualization features: 
   Virtualization:        VT-x
   Hypervisor vendor:     KVM
   Virtualization type:   full
 Caches (sum of all):     
   L1d:                   3.5 MiB (112 instances)
   L1i:                   3.5 MiB (112 instances)
   L2:                    224 MiB (56 instances)
   L3:                    32 MiB (2 instances)
 NUMA:                    
   NUMA node(s):          2
   NUMA node0 CPU(s):     0-47
   NUMA node1 CPU(s):     48-111
 Vulnerabilities:         
   Gather data sampling:  Not affected
   Itlb multihit:         Not affected
   L1tf:                  Not affected
   Mds:                   Not affected
   Meltdown:              Not affected
   Mmio stale data:       Unknown: No mitigations
   Retbleed:              Not affected
   Spec rstack overflow:  Not affected
   Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
   Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
   Spectre v2:            Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop
   Srbds:                 Not affected
   Tsx async abort:       Mitigation; TSX disabled
 # nvidia-smi topo -m
 ****************************************************************************
 * hwloc 2.9.2 received invalid information from the operating system.
 *
 * Failed with: intersection without inclusion
 * while inserting Group0 (cpuset 0x0000ffff,0xffffffff,0xffff0000,0x0) at Package (P#0 cpuset 0x00ffffff,0xffffffff)
 * coming from: linux:sysfs:numa
 *
 * The following FAQ entry in the hwloc documentation may help:
 *   What should I do when hwloc reports "operating system" warnings?
 * Otherwise please report this error message to the hwloc user's mailing list,
 * along with the files generated by the hwloc-gather-topology script.
 * 
 * hwloc will now ignore this invalid topology information and continue.
 ****************************************************************************
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
 ****************************************************************************
 * hwloc 2.9.2 received invalid information from the operating system.
 *
 * Failed with: intersection without inclusion
 * while inserting Group0 (cpuset 0x0000ffff,0xffffffff,0xffff0000,0x0) at Package (P#0 cpuset 0x00ffffff,0xffffffff)
 * coming from: linux:sysfs:numa
 *
 * The following FAQ entry in the hwloc documentation may help:
 *   What should I do when hwloc reports "operating system" warnings?
 * Otherwise please report this error message to the hwloc user's mailing list,
 * along with the files generated by the hwloc-gather-topology script.
 * 
 * hwloc will now ignore this invalid topology information and continue.
 ****************************************************************************
 GPU0    X  NV18    NV18    NV18    NV18    NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-111   0-1     N/A
 GPU1   NV18     X  NV18    NV18    NV18    NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-111   0-1     N/A
 GPU2   NV18    NV18     X  NV18    NV18    NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-111   0-1     N/A
 GPU3   NV18    NV18    NV18     X  NV18    NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-111   0-1     N/A
 GPU4   NV18    NV18    NV18    NV18     X  NV18    NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-111   0-1     N/A
 GPU5   NV18    NV18    NV18    NV18    NV18     X  NV18    NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-111   0-1     N/A
 GPU6   NV18    NV18    NV18    NV18    NV18    NV18     X  NV18    PHB PHB PHB PHB PHB PHB PHB PHB 0-111   0-1     N/A
 GPU7   NV18    NV18    NV18    NV18    NV18    NV18    NV18     X  PHB PHB PHB PHB PHB PHB PHB PHB 0-111   0-1     N/A
 NIC0   PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB PHB PHB PHB PHB             
 NIC1   PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB PHB PHB PHB             
 NIC2   PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB PHB PHB             
 NIC3   PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB PHB             
 NIC4   PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB PHB             
 NIC5   PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB PHB             
 NIC6   PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X  PHB             
 NIC7   PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB PHB  X              

 Legend:

   X    = Self
   SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
   NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
   PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
   PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
   PIX  = Connection traversing at most a single PCIe bridge
   NV#  = Connection traversing a bonded set of # NVLinks

 NIC Legend:

   NIC0: mlx5_0
   NIC1: mlx5_1
   NIC2: mlx5_2
   NIC3: mlx5_3
   NIC4: mlx5_4
   NIC5: mlx5_5
   NIC6: mlx5_6
   NIC7: mlx5_7

 # numactl --show
 policy: default
 preferred node: current
 physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 
 cpubind: 0 1 
 nodebind: 0 1 
 membind: 0 1 
 #

Logs

support.zip

Environment

machineconfig with the install and kubelet details.

install:
    bootloader: true
    diskSelector:
      serial: xxxxxxxxxx
    extensions:
    - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:535.129.03-v1.7.2
    extraKernelArgs:
    - default_hugepagesz=1G
    - hugepagesz=1G
    - hugepages=1950
    - iommu=pt
    - intel_iommu=on
    - amd_iommu=on
    - vfio_iommu_type1.allow_unsafe_interrupts=1
    image: ghcr.io/siderolabs/installer:v1.7.2
    wipe: false
  kernel:
    modules:
    - name: vfio
    - name: vfio-pci
    - name: vfio_iommu_type1
  kubelet:
    defaultRuntimeSeccompProfileEnabled: true
    disableManifestsDirectory: true
    extraArgs:
      cpu-manager-policy: static
      cpu-manager-reconcile-period: 5s
      topology-manager-policy: restricted
      topology-manager-scope: container
      memory-manager-policy: Static
      reserved-memory: '0:memory=3Gi;1:memory=2148Mi'
      kube-reserved: "cpu=4,memory=4Gi"
      system-reserved: "cpu=1,memory=1Gi"
      feature-gates: "TopologyManager=true,CPUManager=true,MemoryManager=true"
      node-labels: metal.sidero.dev/uuid=xxxxxxxxxxxx
    extraConfig:
      registerWithTaints:
      - effect: NoSchedule
        key: xxxxxxxx
        value: "true"
    image: ghcr.io/siderolabs/kubelet:v1.28.6`
- Talos version: [`talosctl version --nodes <problematic nodes>`]
`Client:
    Tag:         v1.6.1
    SHA:         0af17af3
    Built:       
    Go version:  go1.21.5 X:loopvar
    OS/Arch:     darwin/amd64
Server:
    NODE:        10.5.19.250
    Tag:         v1.7.2
    SHA:         f876025b
    Built:       
    Go version:  go1.22.3
    OS/Arch:     linux/amd64
    Enabled:     RBAC

Kubernetes version: [kubectl version --short]

Client Version: v1.29.3
Server Version: v1.28.6

Platform: x86_64

smira commented 2 months ago

I'm pretty sure if you try to change topology managers, you have to wipe kubelet state? (@TimJones is it true?)

So it might not work on the fly, but the easiest clean test is to have those kubelet extraArgs on the initial machine creation.

frezbo commented 2 months ago

I'm pretty sure if you try to change topology managers, you have to wipe kubelet state?

i believe that's the case

smira commented 2 months ago

Talos handles some of that, but not all, but it track kubelet configuration, not extraArgs (extraConfig in machine configuration terms).

The CPU manager should be handled, but not sure about other managers.

smira commented 2 months ago

I mean changes to the topology managers are handled to wipe kubelet state, but nevertheless kubelet doesn't recommend changing it on the fly, as it won't affect correctly running pods.

dle-hpe commented 2 months ago

We do not change the topology configs on the fly. its a wipe and redeploy.

dle-hpe commented 2 months ago

Talos handles some of that, but not all, but it track kubelet configuration, not extraArgs (extraConfig in machine configuration terms).

The CPU manager should be handled, but not sure about other managers.

Do you mean adding both the CPUManager feature flag and CPUManager policy configs to extraConfig?

smira commented 2 months ago

Do you mean adding both the CPUManager feature flag and CPUManager policy configs to extraConfig?

This should not make a difference, but in general kubelet deprecates flags and prefers config.

But getting back to the issue, what makes you think this is a Talos Linux bug vs. the kubelet bug/misconfiguration?

dle-hpe commented 2 months ago

Do you mean adding both the CPUManager feature flag and CPUManager policy configs to extraConfig?

This should not make a difference, but in general kubelet deprecates flags and prefers config.

But getting back to the issue, what makes you think this is a Talos Linux bug vs. the kubelet bug/misconfiguration?

Kubelet sees the available resources, matches what cpu and memory requests and tries reserves the resources. When the resource gets reserved is when the error Resources cannot be allocated with Topology locality comes up.

smira commented 2 months ago

This seems to be a pretty "normal" error documented in Kubernetes docs.

dle-hpe commented 2 months ago

We will remove Talos from the stack and use Ubuntu. Replicate the NUMA aware deployment.

Should I close this issue?

smira commented 2 months ago

No, let's keep the issue, but my point is we can't really help just with the issue, as it's not fully reproducible (depends on the hardware and workloads), but you can use the steps describe in the Kubernetes docs above to troubleshoot it a bit further down the stack.

The issue I think is still valid to support changing other topology manager state without wiping the node completely.

smira commented 2 months ago

So if something does or does not work with Ubuntu might give us some ideas, but probably it would be easier to figure out why in your case kubelet can't satisfy the constraints you're specifying.

E.g. you could do talosctl read /var/lib/kubelet/memory_manager_state to see the memory state.

dle-hpe commented 2 months ago

Ok cool. Will keep the issue open. We will remove Talos and use Ubuntu. Replicate everything else. Report back.

In regards to changing the topology manager settings. The host gets wiped because there is a state file with a checksum. Any topology changes will cause checksum to not be valid and kubelet will not start.

smira commented 2 months ago

Any topology changes will cause checksum to not be valid and kubelet will not start.

that's what we have a workaround for CPU manager specifically (but not others).

smira commented 2 months ago

@dle-hpe any updates on this one?

dle-hpe commented 2 months ago

Still working on getting a test environment setup.

dle-hpe commented 2 months ago

spent time to carefully verify in a few different scenarios and can confirm the limitation is being hit at the Kubevirt layer emulating NUMA domains causing issues with our workloads.

siderolabs / talos