KVM HassOS Freezes roughtly every 24 hours

justynbell commented 3 years ago

Hardware Environment

Hass OVA KVM (.qcow2)

Home Assistant OS release:

System Health

version	core-2021.4.6
installation_type	Home Assistant OS
dev	false
hassio	true
docker	true
virtualenv	false
python_version	3.8.7
os_name	Linux
os_version	5.4.109
arch	x86_64
timezone	America/Los_Angeles

Home Assistant Cloud

Home Assistant Supervisor

Lovelace

Supervisor logs:

Journal logs:

Kernel logs:

Description of problem:

For the last few months, once every 24 hours my Home Assistant installation freezes causing Virt-Manager to completely lock up until I perform a virsh destroy && virsh start on the HassOS KVM.

When I initially started playing with Home Assistant in February, I started with the HassOS .qcow2 image and had this problem. Being new to HA but not Linux, I deleted that VM, and spun up a Debian VM and installed HA following these directions: https://community.home-assistant.io/t/installing-home-assistant-supervised-on-debian-10/200253

The same problem popped up that the server would crash at some point in time, I'd have to destroy and start the VM (which is the equivalent of pulling the plug on a Pi or dedicated machine), and then it would crash another ~24 hours from then.

My last attempt to rule out anything related to plugins/automations, etc, was to again go back to installing HassOS KVM on another VM. This time I didn't touch anything on it, and just saw if it could run for more than 24 hours. It turns out it couldn't.

My HA installations are all on a VLAN (5) on the 192.168.5.0/24 network. My firewall/DNS/DHCP server is a pfSense box at 192.168.5.1, and the host's bridge IP is 192.168.5.201. The HassOS VM gets a static IP from the router. The Debian VM is outside of the DHCP pool, and has a static IP configured in the node. Everything is static in my IoT VLAN except 5 Wemo Smart Dimmers that cannot be set to have static IPs.

Last night I actually caught it in the act of crashing, while I had Wireshark up on the virbr5 bridge interface on the host OS. Doesn't look like any traffic spikes to me, does it? The large amounts of network traffic at the beginning of the graph are from when my CCTV VLAN could talk to my IoT VLAN (the one HA is on). I firewalled up the intra network communication, and just watched the traffic die to nothing. Until HA died for good. You can see in the windows in the back right that's Virt Manager completely frozen. So unfortunately, I can't "plug in a monitor and see what it says when it crashes", it just freezes.

There have been other reports of this happening on the Pi's but because this is a VM, I was instructed to open a new issue. So here I am.

Regardless of what is happening on the network, I don't think network traffic issues should ever freeze a VM, right? I think this is a HA issue that needs to be addressed with some amount of priority. HA is awesome other than this freezing issue, but this issue is a big one.

There are a few posts about this exact same thing in the Git issue tracker, but you'd think if this was so common that even a fresh install would crash every 24 hours, the community forums would be overwhelmed with this issue, right?

justynbell commented 3 years ago

These are screenshots from the Debian-hosted HA VM that crashes every 24 hours: CPU: haCPU Memory: haMemory Swap: haSwap

justynbell commented 3 years ago

Actually, as I mentioned before, since this is an issue that happens on both a HassOS install as well as a Supervised install on top of Debian, would this issue be related to the Home Assistant Core instead? Perhaps a Docker bug or something?

justynbell commented 3 years ago

Small update: I received a 4GB Raspberry Pi 4 to try this out on, so now I'm running 3 instances of Home Assistant at home: the HassOS version in a VM, the Debian Managed version in a VM, and a version on the Pi 4. They're all on VLAN 5, all configured as static IPs and on the same network.

The Debian and HassOS version crash every 24 hours, the Pi version survived past the normal 24 hour mark.

So it seems agners is correct in the other thread: whatever is plaguing those guys on the Raspberry Pi doesn't seem to be my issue with the VMs here.

I guess my next step is to spin a KVM on another Linux machine to rule out my box being the issue.

It would be nice for someone somewhere else to spin up a KVM instance to verify grabbing the HassOS release and hosting it unmodified doesn't freeze after 24 hours.

bschatzow commented 3 years ago

@justynbell I read with great interest you information as I and many others on the Pi4 have reported this very problem since OS 5.4 was updated. I think that the PI freeze is OS related as it doesn't freeze will all the other updates except OS. I am also running a test using RPI OS with Home Assistant as a supervisor. So far, it is up for almost a week with no issues. HA OS above 5.4 crashes in less than 24 hours. Usually in less than 5.

teamsuperpanda commented 3 years ago

Personally we are still using 5.3 until this gets solved

agners commented 3 years ago

@teamsuperpanda are you using KVM and experience the same problem the original poster reports? What host system are you using?

jspanitz commented 3 years ago

Have same issue on fairly new install (<3 months) on PI 4. Just upgraded to OS 6 today, will see if issue is resolved.

teamsuperpanda commented 3 years ago

Have same issue on fairly new install (<3 months) on PI 4. Just upgraded to OS 6 today, will see if issue is resolved.

You are the brave soul I am looking for. God speed!

justynbell commented 3 years ago

@teamsuperpanda @jspanitz

You guys need to post in here.

This thread is for HA issues in a KVM, not the Raspberry Pis.

agners commented 3 years ago

@justynbell I am guessing you can still reproduce this?

When you write "Virt-Manager to completely lock up", does that mean that also the UI of the Virt-Manager locks up? If so, then this seems more like a host issue. Normally Virt-Manager is a separate process from qemu which runs KVM... Does the kernel logs or host logs in general maybe have hints right when the freeze happens?

Maybe before spending to much time, it is worth testing OS 6.0. It comes with a new Linux kernel which might fix whatever issue this is you are seeing.

justynbell commented 3 years ago

Hey @agners, thanks for following up. I added a cronjob that reboots the VM every 12 hours, and that's been working for a long time, so yesterday I turned off the cronjob to see if HA still freezes after 24 hours. Unfortunately, it does.

When you write "Virt-Manager to completely lock up", does that mean that also the UI of the Virt-Manager locks up? If so, then this seems more like a host issue.

The UI doesn't lock up, as in freeze (goes gray, you have to force close it). Rather, it just displays a few VMs, but not the Home Assistant VM. It's as if under the hood, it iterates over all of my VMs, then sits there and blocks trying to get the HA VM. The rest don't load (presumably the ones virt-manager would read and display that come after the HA VM).

Does the kernel logs or host logs in general maybe have hints right when the freeze happens?

This is going to be a long back-and-forth because each time I have to wait for 24 hours, but here's the kernel output on the host after the crash last night (June 18th after midnight):

sudo dmesg -T [Tue Jun 15 20:51:57 2021] br-6f18bc7c486d: port 10(veth33f6e7e) entered forwarding state [Tue Jun 15 20:51:58 2021] br-a43001fe8bf2: port 4(veth5cad88f) entered forwarding state [Tue Jun 15 20:51:59 2021] br-6f18bc7c486d: port 2(veth21af380) entered forwarding state [Tue Jun 15 20:51:59 2021] br-6f18bc7c486d: port 4(veth1224953) entered forwarding state [Tue Jun 15 20:52:00 2021] br-6f18bc7c486d: port 12(veth628ce5d) entered forwarding state [Tue Jun 15 20:52:06 2021] br-a43001fe8bf2: port 9(veth39122f1) entered forwarding state [Tue Jun 15 20:52:06 2021] br-6f18bc7c486d: port 6(veth4d85b8f) entered forwarding state [Wed Jun 16 00:00:44 2021] kvm [3107]: vcpu0 unhandled rdmsr: 0xc0011020 [Wed Jun 16 00:00:44 2021] kvm [3107]: vcpu0 unhandled rdmsr: 0xc0000408 [Wed Jun 16 00:00:44 2021] kvm [3107]: vcpu1 unhandled rdmsr: 0xc0000408 [Wed Jun 16 00:01:01 2021] perf interrupt took too long (2686 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [Wed Jun 16 03:31:40 2021] hrtimer: interrupt took 90731 ns [Wed Jun 16 12:00:29 2021] kvm [3107]: vcpu0 unhandled rdmsr: 0xc0011020 [Wed Jun 16 12:00:29 2021] kvm [3107]: vcpu0 unhandled rdmsr: 0xc0000408 [Wed Jun 16 12:00:29 2021] kvm [3107]: vcpu1 unhandled rdmsr: 0xc0000408 [Thu Jun 17 00:00:38 2021] kvm [3107]: vcpu0 unhandled rdmsr: 0xc0011020 [Thu Jun 17 00:00:39 2021] kvm [3107]: vcpu0 unhandled rdmsr: 0xc0000408 [Thu Jun 17 00:00:39 2021] kvm [3107]: vcpu1 unhandled rdmsr: 0xc0000408 [Fri Jun 18 07:15:13 2021] usb 3-3: USB disconnect, device number 2 [Fri Jun 18 07:15:16 2021] usb 3-3: new low-speed USB device number 3 using ohci-pci [Fri Jun 18 07:15:16 2021] usb 3-3: New USB device found, idVendor=1d57, idProduct=32da [Fri Jun 18 07:15:16 2021] usb 3-3: New USB device strings: Mfr=0, Product=2, SerialNumber=0 [Fri Jun 18 07:15:16 2021] usb 3-3: Product: 2.4G Receiver [Fri Jun 18 07:15:16 2021] input: 2.4G Receiver as /devices/pci0000:00/0000:00:12.0/usb3/3-3/3-3:1.0/0003:1D57:32DA.0005/input/input17

I'm giving a little context here in the logs: from the 15th to the 18th, it's fine (I assume those rdmsr messages are benign). This morning I replugged a USB dongle in the server just to see it in the kernel logs. Long story short, I don't see any kernel messages when the VM freezes. Again, the crash should have happened late last night (June 17th late at night, or the 18th early in the morning).

I wouldn't doubt that my host is the issue. It's an 8 year old box with an FX-6100 in it; it's in desperate need for an upgrade. But the lack of logs on the host, coupled with running into these vague explanations while doing extensive googling that lead me to think maybe it's something on HAs side related to memory issues that come from abnormal network traffic or something.

Maybe before spending to much time, it is worth testing OS 6.0. It comes with a new Linux kernel which might fix whatever issue this is you are seeing.

That's fine, I can spin a VM using the qcow2 image with the latest HA OS, but right now the HA instance I actively use is a Debian managed one. It's not just the Home Assistant Operating System that does this. The Debian kernel version is Linux debian-home-assistant 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

justynbell commented 3 years ago

Something in 2021.8.x fixed this issue.

I completely forgot about this because I simply added a cronjob that restarts HA twice a day at noon an midnight, and it would run "indefinitely". I just noticed that I turned this cronjob off on August 10th to test to see if this issue was still happening in the amazing 2021.8.x release (I can't remember which minor version I upgraded to initially), and then promptly forgot about it. It was only today when I was checking the cronjobs of all my VMs I saw the restart job had been disabled.

Keep up the fantastic work.

agners commented 3 years ago

Cool, thanks for the update!

tom-winkler commented 1 year ago

Thanks for posting - I Think I ended up in similar space....my vm is not reacting anymore even though virsh happily stating it is up and running resulting in no network address listed and obviously a freeze of the entire system without a good chance for self recovery. All I'm interested is in a detection scenario now - so how to detect in order to be able to reboot? All my automations come to halt. Anyone a good idea?

"home_assistant": { "installation_type": "Home Assistant OS", "version": "2022.11.5", "dev": false, "hassio": true, "virtualenv": false, "python_version": "3.10.7", "docker": true, "arch": "x86_64", "timezone": "Europe/Berlin", "os_name": "Linux", "os_version": "5.15.74", "supervisor": "2022.11.2", "host_os": "Home Assistant OS 9.3", "docker_version": "20.10.18", "chassis": "vm", "run_as_root": true },

home-assistant / operating-system

KVM HassOS Freezes roughtly every 24 hours #1338

System Health