Closed DovydasNavickas closed 6 months ago
@DovydasNavickas are you noticing any CreateContainerError
on any of the VirtLauncher pods?
Thinking :thinking: there may be some similarities to:
https://docs.harvesterhci.io/v1.3/troubleshooting/vm/#vm-stuck-in-starting-state-with-error-messsage-not-a-device-node
Before trying to delete the Pod
, I think I saw CreateContainerError
. I will try to follow the instructions in the link. Thank you 👍
After force deleting the pod I see CreateContainerError
in k9s, though in Harvester UI the status is Starting
and a note saying Guest VM is not reported as running
:
The docs helped, though wildcard at the end of path didn't work for me:
$ sudo umount /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-7e6d34c5-72cd-43cb-bc03-4c386c9f0df3/dev/*
umount: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-7e6d34c5-72cd-43cb-bc03-4c386c9f0df3/dev/*: no mount point specified.
Thus, I manually unmounted all paths within that directory.
After unmounting the paths and uncordoning the node, VM started normally.
I guess the bug is known and well described in this issue, so I can close the current one: https://github.com/harvester/harvester/issues/5109
Thank you @irishgordo! I didn't find those docs while searching for information prior.
@DovydasNavickas - glad that was able to help :smile: :+1:
Describe the bug
I have 3 nodes cluster (one of them is a witness node). After reboot, VMs that were running in the node rebooted don't start anymore and when I try deleting the Pod, it goes into
Stopping
status and stays there forever:The events from VM
![image](https://github.com/harvester/harvester/assets/7989797/7517a2a6-fa6b-499c-9da2-1e3e2bf35bb3) ![image](https://github.com/harvester/harvester/assets/7989797/79e167ff-497c-4ea2-8a83-983398235994) ![image](https://github.com/harvester/harvester/assets/7989797/c1979157-5cc1-48ec-9bbb-509edd34b453)To Reproduce Steps to reproduce the behavior:
Expected behavior
VMs after reboot start normally.
Support bundle
Environment
Additional context Add any other context about the problem here.
I had multiple workloads running in the cluster, and to change the DNS server, I needed to reboot the nodes. After the reboot, at least half of the VMs struggled to start. I waited for a while to see if Harvester would recover, then tried deleting pods, but they were stuck in Terminating status. Attempts to delete
VirtualMachineInstance
objects, which were in the Failed phase, also failed. After another reboot and a wait, I tried deletingVirtualMachine
objects as they were stuck in CrashLoopBackOff, but that changed nothing. Finally, after a yet another reboot, the deleted VMs disappeared, leaving the cluster in a state that was not useful anymore and I deleted the remaining VMs and restarted nodes once more. I could delete all VMs because it was a pre-production cluster with no critical data.After removing all the VMs and recreating the Rancher ones, I rebooted the nodes and encountered the issues described above.
In short, the cluster worked well until it was rebooted. I saw other issues mentioning that 1.3.0 should have solved issues with reboots and power outages, but it seems that Harvester still can't recover from a reboot by itself, at least not with all VMs.