seapath / ansible

This repo contains all the ansible playbooks used to deploy or manage a cluster, as well as inventories examples
https://lfenergy.org/projects/seapath/
Apache License 2.0
6 stars 14 forks source link

VM endless boot when pinning to the first allowed CPU of the machine-rt slice. #438

Open eroussy opened 3 months ago

eroussy commented 3 months ago

Describe the bug When deploying an RT and isolated VM, if the core chosen to isolate the VM is the first of the machine-rt slice, the VM will never boot. The associated qemu-system-x86 thread will take 100% of one CPU forever.

To Reproduce

Allowed CPUs in my Ansible inventory :

isolcpus: "2-7" # CPUs to isolate (isolcpus, irqbalance on debian 12)
workqueuemask: "0003" #workqueue mask, here it mean 0 and 1 are the only allowed cpus
cpusystem: "0-1" # CPUs reserves for system
cpuuser: "0-1" # CPUs reserves for user applications
cpumachines: "2-7" # CPUs reserves for VMs
cpumachinesrt: "4-7" # CPUs reserves for VMs realtime
cpumachinesnort: "2-3" # CPUs reserves for VMs non realtime
cpuovs: "0-1" # CPUs reserves for OVS

My RT VM inventory

all:
  children:
    VMs:
      hosts:
        rtVM:
          ansible_host: 192.168.216.24
          vm_template: "../templates/vm/guest.xml.j2"
          vm_disk: "../vm_images/guest.qcow2"
          vm_features: ["rt", "isolated"]
          cpuset: [4, 5]
          bridges:
            - name: "br0"
              mac_address: "52:54:00:e4:ff:03"

Expected behavior The VM must boot. The qemu-system-x86 will take 100% of one CPU, but just for a few seconds.

Additional context

On the hypervisor:

root@seapath:/home/virtu# ps -eTo comm,tid,pid,cls,pri,psr | grep -iE "qemu|kvm"
qemu-event       159384  158373  TS  19   0
qemu-system-x86  158603  158603  TS  19   4
CPU 0/KVM        158651  158603  FF  41   4
CPU 1/KVM        158652  158603  FF  41   5
kvm              158626  158626  TS  39   4
kvm-nx-lpage-re  158627  158627  TS  19   4
kvm-pit/158603   158654  158654  TS  19   4

The qemu-system-x86 thread responsible to manage the VM is always running on the first allowed CPU (here the 4th). The VM's vCPU is also pinned on this CPU. I think the two threads will interrupt each other and prevent the VM to boot.

Also, first lines of the top command on the hypervisor:

top - 15:17:15 up 17 min,  2 users,  load average: 10.32, 7.45, 4.92
Tasks: 542 total,   2 running, 540 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.6 us,  5.3 sy,  0.0 ni, 89.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  63624.4 total,  59392.4 free,   9856.9 used,   1700.3 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.  53767.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  28143 libvirt+  20   0 3703876 441004  41664 S 100.0   0.7   1:27.42 qemu-system-x86
    154 root     -11   0       0      0      0 S   6.2   0.0   0:00.69 rcuc/13
   1763 ceph      20   0 1217808 300224  36096 S   6.2   0.5   0:04.02 ceph-mgr
   3160 haclust+  20   0   81552  25628  15644 S   6.2   0.0   0:00.60 pacemaker-based
  31122 root      20   0   11640   5376   3264 R   6.2   0.0   0:00.02 top
      1 root      20   0  169984  13788   8796 S   0.0   0.0   0:05.89 systemd

The qemu-system-x86 thread is taking 100% of the CPU.

eroussy commented 3 months ago

Here are my investigations so far:

Management threads affinity

rtVM vCPUs are running with rt priority on cores 4 and 5. By taking a look at what's also running on these cores, I find:

root@seapath:~# ps -eTo comm,tid,pid,cls,pri,%cpu,psr  | grep "[4,5]$"
[..] (Linux core management threads)
kworker/3:3-eve  195349  195349  TS  19   0.0   5
qemu-system-x86  197242  197242  TS  19  87.3   4
call_rcu         197265  197242  TS  19   0.0   4
worker           197266  197242  TS  19   0.0   4
vhost-197242     197268  197242  TS  19   0.0   4
IO mon_iothread  197269  197242  TS  19   0.2   4
CPU 0/KVM        197270  197242  FF  41  87.1   4
CPU 1/KVM        197271  197242  FF  41   0.0   5
worker           197398  197242  FF  41   0.0   4
kvm-nx-lpage-re  197267  197267  TS  19   0.0   4

The qemu-system-x86 core is taking too much CPU, so why does it not move to another core ?

root@seapath:~# taskset -cp 197242 #qemu-system
pid 197242's current affinity list: 4-7

The affinity list allows it to move, so why does the scheduler not put it on another core ? I don't know if it's a libvirt bug or a SEAPATH configuration problem.

Workaround

We can control the management thread of the VM with emulatorpin in libvirt. This can be done either in the xml :

<emulatorpin cpuset='6,7'/>

Or directly on the target with the command virsh emulatorpin rtVM 6,7

Both of these commands technically solve the problem:

But it shouldn't be mandatory to specify this.