oVirt / vdsm

The Virtual Desktop Server Manager
GNU General Public License v2.0
160 stars 201 forks source link

VM stuck in unresponsive state and prohibits listing processes on host #389

Open ddrazyk opened 1 year ago

ddrazyk commented 1 year ago

We had an issue on 3 out of 4 hosts in an ovirt cluster (4.5.4-1.el8) where one VM is stuck in unresponsive state. It cannot be powered down nor restarted and as long as it's qemu process is running I can't list processes on that host. VM is unreachable through network and ovirt's VNC console. The only way to resolve the issue is to restart host from ovirt webUI (or kill qemu process).
I can see in vdsm logs such entries:

2023-05-05 21:27:52,848+0200 ERROR (qgapoller/1) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7fe08c0d9630>> operation failed (periodic:187) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 185, in call self._func() File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 476, in _poller vm_id, self._qga_call_get_vcpus(vm_obj)) File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 797, in _qga_call_get_vcpus if 'online' in vcpus: TypeError: argument of type 'NoneType' is not iterable

And then eventually leads to: 2023-05-05 21:45:17,709+0200 ERROR (vm/220746d4) [virt.vm] (vmId='220746d4-56a5-40cc-8633-1285c167c4fe') Failed to update CPU set of the VM to match shared pool (cpumanagement:121) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 104, in f ret = attr(*args, kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 114, in wrapper ret = f(*args, *kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 78, in wrapper return func(inst, args, kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2303, in pinVcpu raise libvirtError('virDomainPinVcpu() failed') libvirt.libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/cpumanagement.py", line 108, in _assign_shared vm.pin_vcpu(vcpu, cpuset) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 6306, in pin_vcpu self._dom.pinVcpu(vcpu, cpuset) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 112, in f raise toe vdsm.virt.virdomain.TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)

This causes CPU to stuck on qemu process. If I forcibly kill the process everything gets back to normal, but ovirt reports vm's state as "unresponsive" or "powering down" if I try to shut it down from webUI. Hosts are connected via glusterfs FUSE which runs on separate hosts (3 hosts with replica 3 and jbod setup with 6 nvme disks). All hosts (hypervisors and gluster) use CentOS 8 Stream.

Version-Release number of selected component: 4.50.3.4-1.el8.x86_64

mz-pdm commented 1 year ago

As for the first traceback, the issue is fixed in Vdsm 4.50.5. It may be worth to upgrade Vdsm and see whether it fixes the problem.

ddrazyk commented 1 year ago

Hi @mz-pdm, I will update to Vdsm 4.50.5 during next update window and see if the error message goes away. For the crashes - they seems unrelated to vdsm - after migrating all hypervisor hosts to Rocky8 the issue did not occur for 4 consecutive days.