vCPU exposed as threads pinned to non-thread sibling pCPUs on hosts without SMT when using `dedicatedCpuPlacement`

lyarwood commented 7 months ago

/cc @vladikr

What happened:

This is likely open to interpretation around the goals of dedicatedCpuPlacement but at present on hosts without SMT enabled vCPUs exposed as threads to the guest OS are pinned to non-sibling pCPUs. Users might be surprised by the performance of workloads across such threads given the request.

What you expected to happen:

The VirtualMachineInstance to not schedule until a SMT enabled compute is present in the environment.

How to reproduce it (as minimally and precisely as possible):

Uses https://github.com/kubevirt/kubevirtci/pull/1171

$ export \
 KUBEVIRT_PROVIDER=k8s-1.28 \
 KUBEVIRT_MEMORY_SIZE=$((16 * 1024))M \
 KUBEVIRT_HUGEPAGES_2M=$((4 * 1024)) \
 KUBEVIRT_DEPLOY_CDI=true \
 KUBEVIRT_CPU_MANAGER_POLICY=static \
 KUBEVIRT_NUM_NUMA_NODES=2 \
 KUBEVIRT_NUM_VCPU=16 \
 KUBEVIRTCI_CONTAINER_SUFFIX=latest \
 KUBEVIRTCI_GOCLI_CONTAINER=quay.io/kubevirtci/gocli:latest
$ cd kubevirtci
$ make cluster-up
$ cd ../kubevirt
$ rsync -av ../kubevirtci/cluster-up/ ./cluster-up/ && rsync -av ../kubevirtci/_ci-configs/ ./_ci-configs/
$ make cluster-sync
[..]
$ ./cluster-up/kubectl.sh patch kv/kubevirt -n kubevirt --type merge -p '{"spec":{"configuration":{"developerConfiguration":{"featureGates": ["CPUManager","NUMA"]}}}}'
[..]
$ ./cluster-up/kubectl.sh apply -f -<<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  name: dedicated-threads
spec:
  domain:
    cpu:
      threads: 2
      dedicatedCpuPlacement: true
    devices:
      disks:
        - disk:
            bus: virtio
          name: containerdisk
        - disk:
            bus: virtio
          name: cloudinitdisk
    resources:
      requests:
        memory: 2Gi
  volumes:
    - containerDisk:
        image: quay.io/containerdisks/fedora:39
      name: containerdisk
    - cloudInitNoCloud:
        userData: |
          #!/bin/sh
          mkdir -p  /home/fedora/.ssh
          curl https://github.com/lyarwood.keys > /home/fedora/.ssh/authorized_keys
          chown -R fedora: /home/fedora/.ssh
      name: cloudinitdisk
EOF

$ ./cluster-up/virtctl.sh ssh -lfedora dedicated-threads lscpu
[..]
Thread(s) per core:  2
Core(s) per socket:   1
Socket(s):                   1
 [..]

$ ./cluster-up/kubectl.sh exec pods/virt-launcher-dedicated-threads-6lwcm -- virsh vcpuinfo 1
selecting podman as container runtime
VCPU:           0
CPU:            1
State:          running
CPU time:       8.0s
CPU Affinity:   -y--------------

VCPU:           1
CPU:            2
State:          running
CPU time:       4.3s
CPU Affinity:   --y-------------

$ ./cluster-up/ssh.sh node01 lscpu
[..]
Thread(s) per core:                 1
Core(s) per socket:                 16
Socket(s):                                 1
[..]

Additional context: N/A

Environment:

KubeVirt version (use virtctl version): N/A
Kubernetes version (use kubectl version): N/A
VM or VMI specifications: N/A
Cloud provider or hardware configuration: N/A
OS (e.g. from /etc/os-release): N/A
Kernel (e.g. uname -a): N/A
Install tools: N/A
Others: N/A

lyarwood commented 7 months ago

Just for context, guestMappingPassthrough doesn't check for this case either:

./cluster-up/kubectl.sh apply -f -<<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  name: dedicated-threads-with-numa
spec:
  domain:
    cpu:
      threads: 2
      dedicatedCpuPlacement: true
      numa:
        guestMappingPassthrough: {}
    memory:
      guest: 1Gi
      hugepages:
        pageSize: "2Mi"
    devices:
      disks:
        - disk:
            bus: virtio
          name: containerdisk
        - disk:
            bus: virtio
          name: cloudinitdisk
    resources:
      requests:
        memory: 2Gi
  volumes:
    - containerDisk:
        image: quay.io/containerdisks/fedora:39
      name: containerdisk
    - cloudInitNoCloud:
        userData: |
          #!/bin/sh
          mkdir -p  /home/fedora/.ssh
          curl https://github.com/lyarwood.keys > /home/fedora/.ssh/authorized_keys
          chown -R fedora: /home/fedora/.ssh
      name: cloudinitdisk
EOF
[..]
$ ./cluster-up/kubectl.sh get vmis
NAME                          AGE   PHASE     IP               NODENAME   READY
dedicated-threads-with-numa   12s   Running   10.244.196.149   node01     True

vladikr commented 7 months ago

Thanks @lyarwood.

Indeed, this is open to interpretation. So far, the general intent was to provide dedicated CPUs to the guest however, the topology assignment was on a best-efforts basis. The only exception was with guestMappingPassthrough where we are strict about numa nodes assignment.

We can of course follow up and improve the correct behavior by further enhancing the dedicatedCPUs API. I would start by reporting whether SMP is enabled on the nodes.

lyarwood commented 7 months ago

We can of course follow up and improve the correct behavior by further enhancing the dedicatedCPUs API. I would start by reporting whether SMP is enabled on the nodes.

ACK thanks for confirming this is a valid thing to fix. It should be easy enough to label a node given the value in /sys/devices/system/cpu/smt/active and to use that label when scheduling later if threads is greater than 1. I'll try to find some time to work on this in the coming weeks.

/assign

kubevirt-bot commented 4 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

lyarwood commented 4 months ago

/remove-lifecycle stale

kubevirt-bot commented 1 month ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot commented 1 week ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubevirt / kubevirt

vCPU exposed as threads pinned to non-thread sibling pCPUs on hosts without SMT when using `dedicatedCpuPlacement` #11749