[BUG] VMs stuck Stopping or not starting after reboot or power outage

DovydasNavickas commented 6 months ago

Describe the bug

I have 3 nodes cluster (one of them is a witness node). After reboot, VMs that were running in the node rebooted don't start anymore and when I try deleting the Pod, it goes into Stopping status and stays there forever:

The events from VM

![image](https://github.com/harvester/harvester/assets/7989797/7517a2a6-fa6b-499c-9da2-1e3e2bf35bb3) ![image](https://github.com/harvester/harvester/assets/7989797/79e167ff-497c-4ea2-8a83-983398235994) ![image](https://github.com/harvester/harvester/assets/7989797/c1979157-5cc1-48ec-9bbb-509edd34b453)

To Reproduce Steps to reproduce the behavior:

Create several VMs
Reboot one, several or all nodes, the results seems to be the same.

Expected behavior

VMs after reboot start normally.

Support bundle

Environment

Harvester ISO version: 1.3.0
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Supermicro servers + HP NUC as a witness node

Additional context Add any other context about the problem here.

I had multiple workloads running in the cluster, and to change the DNS server, I needed to reboot the nodes. After the reboot, at least half of the VMs struggled to start. I waited for a while to see if Harvester would recover, then tried deleting pods, but they were stuck in Terminating status. Attempts to delete VirtualMachineInstance objects, which were in the Failed phase, also failed. After another reboot and a wait, I tried deleting VirtualMachine objects as they were stuck in CrashLoopBackOff, but that changed nothing. Finally, after a yet another reboot, the deleted VMs disappeared, leaving the cluster in a state that was not useful anymore and I deleted the remaining VMs and restarted nodes once more. I could delete all VMs because it was a pre-production cluster with no critical data.

After removing all the VMs and recreating the Rancher ones, I rebooted the nodes and encountered the issues described above.

In short, the cluster worked well until it was rebooted. I saw other issues mentioning that 1.3.0 should have solved issues with reboots and power outages, but it seems that Harvester still can't recover from a reboot by itself, at least not with all VMs.

irishgordo commented 6 months ago

@DovydasNavickas are you noticing any CreateContainerError on any of the VirtLauncher pods? Thinking :thinking: there may be some similarities to: https://docs.harvesterhci.io/v1.3/troubleshooting/vm/#vm-stuck-in-starting-state-with-error-messsage-not-a-device-node

DovydasNavickas commented 6 months ago

Before trying to delete the Pod, I think I saw CreateContainerError. I will try to follow the instructions in the link. Thank you 👍

DovydasNavickas commented 6 months ago

After force deleting the pod I see CreateContainerError in k9s, though in Harvester UI the status is Starting and a note saying Guest VM is not reported as running:

New `Pod` events

```logs Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 112s default-scheduler Successfully assigned reactway-infrastructure/virt-launcher-rancher-01-wb58h to compute-01 Normal SuccessfulMountVolume 112s kubelet MapVolume.MapPodDevice succeeded for volume "pvc-7e6d34c5-72cd-43cb-bc03-4c386c9f0df3" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-7e6d34c5-72cd-43cb-bc03-4c386c9f0df3/dev" Normal SuccessfulMountVolume 112s kubelet MapVolume.MapPodDevice succeeded for volume "pvc-7e6d34c5-72cd-43cb-bc03-4c386c9f0df3" volumeMapPath "/var/lib/kubelet/pods/5aadbe84-00a7-48c9-9bf2-8ab031a9ad08/volumeDevices/kubernetes.io~csi" Normal Created 111s kubelet Created container guest-console-log Normal Started 111s kubelet Started container guest-console-log Normal AddedInterface 111s multus Add pod37a8eec1ce1 [] from infrastructure/80-dc Warning Failed 111s kubelet Error: failed to generate container "a9ae83af2d3dffaf82e3b630e4c61b1e38b54f2f8e1883ae99ea8d5e2de1ad77" spec: failed to generate spec: not a device node Normal Pulled 111s kubelet Container image "registry.suse.com/suse/sles/15.5/virt-launcher:1.1.0-150500.8.6.1" already present on machine Normal AddedInterface 111s multus Add eth0 [10.52.0.31/32] from k8s-pod-network Warning Failed 110s kubelet Error: failed to generate container "6a47d473c61e7223fd576cc8488c5011e721d712ffa2199cdb5bd781a7ac8421" spec: failed to generate spec: not a device node Warning Failed 109s kubelet Error: failed to generate container "be9c3ca0cb6745ce5262fc19d31d34231968e14ceb724fc7716b93ed1a408105" spec: failed to generate spec: not a device node Warning Failed 104s kubelet Error: failed to generate container "8e689be2796efd86575530836987cef60ac40b8c29887471a1c37eebd85ec40d" spec: failed to generate spec: not a device node Warning Failed 90s kubelet Error: failed to generate container "09d62ef031c8bcf478e2b81062e00168d1d596fac010fc2fbb20d7595457c4a3" spec: failed to generate spec: not a device node Warning Failed 76s kubelet Error: failed to generate container "862d32482ebbb0a3d33fd221a3c761450be1368f341e758f2eb6d42184c819c6" spec: failed to generate spec: not a device node Warning Failed 61s kubelet Error: failed to generate container "8713b46054083cd92dcccb2dacb91aef9b7236cba419ffb50234f7c43cc22023" spec: failed to generate spec: not a device node Warning Failed 50s kubelet Error: failed to generate container "d2bbf1b0a63529849097030d699885ae9072c2ecbdab503ac2895b3dd93ba3d4" spec: failed to generate spec: not a device node Warning Failed 37s kubelet Error: failed to generate container "8f9945f6ca8a142272f1cf64221652c98a72e368f55270de93817661428563d8" spec: failed to generate spec: not a device node Normal Pulled 22s (x10 over 111s) kubelet Container image "registry.suse.com/suse/sles/15.5/virt-launcher:1.1.0-150500.8.6.1" already present on machine Warning Failed 22s kubelet (combined from similar events): Error: failed to generate container "71a82627737f1e9e095da5a47d4b7f5572b76b9a5a22112dbabc410d8b5f4729" spec: failed to generate spec: not a device node ```

The docs helped, though wildcard at the end of path didn't work for me:

$ sudo umount /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-7e6d34c5-72cd-43cb-bc03-4c386c9f0df3/dev/*
umount: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-7e6d34c5-72cd-43cb-bc03-4c386c9f0df3/dev/*: no mount point specified.

Thus, I manually unmounted all paths within that directory.

After unmounting the paths and uncordoning the node, VM started normally.

I guess the bug is known and well described in this issue, so I can close the current one: https://github.com/harvester/harvester/issues/5109

Thank you @irishgordo! I didn't find those docs while searching for information prior.

irishgordo commented 6 months ago

@DovydasNavickas - glad that was able to help :smile: :+1:

harvester / harvester

[BUG] VMs stuck Stopping or not starting after reboot or power outage #5788