home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.86k stars 965 forks source link

rcu_sched detected stalls on VirtualBox #1705

Closed pacoso81 closed 10 months ago

pacoso81 commented 2 years ago

Describe the issue you are experiencing

I had a problem after i installed second docker (miner) on my PC. After installing of the second docker my HASS didn't run any more. I trued to restore my VM image to early point but that did not help, so i update my VirtualBox to latest version and HASS was working. I then restore the HASS from my latest backup and updated HASS to 12.6/12.1/7.0 (before this latest version). After everything started working again i noticed that after a day or two i can't log into HASS by web. In the console i wound that it is frozen at "ha >" and have to do shutdown command and then gives me "rcu: INFO: rcu_sched detected stalls on CPU/tasks: " and a lot of other info, then i have to do another shutdown command to actually shut down the docker.

Strange thing is that after i start it again and go to log i can see that all the sensors and switches wore working wile i can't access the frontend.

i'm running HASS on VirtualBox 6.1.30 on Windows 10. core-2021.12.7 supervisor-2021.12.2 Home Assistant OS 7.1

So now i can't tell is it the HASS or VirtualBox problem. Can you help, please?

What operating system image do you use?

ova (for Virtual Machines)

What version of Home Assistant Operating System is installed?

7.1

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. start the HASS
  2. Works for 1-2 days
  3. Freeze ...

Anything in the Supervisor logs that might be useful for us?

can't see older logs.
after i restart the HASS everything in the log is green except for 2 warning from Samba and Mosquito for unsupported commands.

Anything in the Host logs that might be useful for us?

can't see older logs.

System Health information

No response

Additional information

No response

agners commented 2 years ago

"rcu: INFO: rcu_sched detected stalls on CPU/tasks: " and a lot of other info,

Can you post a screenshot of this info? Can you also post the system, storage and network settings?

It seems that the Linux kernel hangs on something (either disk read or network). In those situations its common that some part of the system still work while others don't.

Driekes commented 2 years ago

I have the exact same thing

Schermafbeelding 2021-12-30 om 17 06 06

Today it also happened at start, during 'A start job is running for Docker Application Container Engine'. Now trying to get it working again. I don't know what happened, I tried to increase CPU cores, but that didn't help.

pacoso81 commented 2 years ago

i fix my problem whit reinstalling VirtualBox to earlier version and restoring snapshoot to earlier backup from 12.2021 and than updating to newest version of VirtualBox and HA. Just reinstalling the VirtualBox to same version didn't help. It is obvious that this is VM problem, in mine case happened just after installing other VM Docker program, that interfered whit VirtualBox files or settings. I propose to do what i did. Just make sure you backup your last few snapshots, copy them to local drive. You can find more info on my reply in HA community post.

agners commented 2 years ago

It is obvious that this is VM problem

What do you mean exactly by that, a problem of VirtualBox or the virtual machine image (HAOS ova)?

You can find more info on my reply in HA community post.

Which community post? Can you add a link to that post?

pacoso81 commented 2 years ago

it happened again. i disconnected the audio device in VirtualBox and changed the chipset in System tab to ICH9. after that i had to select boot device in VM bios. 4-5days have passed and i have no problem.

PowZone commented 2 years ago

Same issue here with latest versions of Home Assistant and VirtualBox :/

https://imgur.com/a/MEjxM01

Rebooting the VM sometime fix this sometime not

Driekes commented 2 years ago

I have not seen it the last weeks, not sure what I did. I have updated all to latest. I did try some things with USB versions, disabling all items that have no relevancy (e.g. like above I removed the audio device I think). I also tested with amount of CPU's, but I don't know what exactly the cause it that it runs stable now.

PowZone commented 2 years ago

Solved for me editing the VM and removing useless devices: floppy, optical, sounds, usb

ExceedingLife commented 2 years ago

I have been getting this issue lately as well. Plus some others. I was thinking my hard drive is going bad. I can sometimes get my HASSIO to boot and keep it up for most the day but then randomly this happens or many other issues I have a big list of screenshot errors happening. like right now I tried reseting my vm like 7 times and still no success turning it on yet. image

I might want to try a fsck on my diff partitions but im afraid itll freeze up in the middle of me running fsck. I'll look at removing useless drives and audio stuff. I am also thinking about trying an older version of Virtualbox. or installing Virtualbox on a different SSD. I switched my VM from 1 HDD that was old to a diff SSD and i been getting the same issues.. so i'm confused about that happening.. Here are some other errors i received. blk_update_request: I/O error, dev sda, sector 3423235... Buffer I/O error on device sda8, logical block 1084234 systemd-coredump: failed to get COMM no such process CIFS: VFS: No username specified rcu: INFO: rcu_sched self-detected stall on CPU systemd-resolved.service: watchdog timeout rcy_sched kthread starved for 348343 jiffies! Failed to start Network Time Sync SQUASHFS error: Unable to read page, block 343423 size 7c6 EXT4-fs error (device sda8): ext4_journal)check_start:83 Detected aborted journal

and more than all these.. hopefully someone with more knowledge can point me in a direction to go. Thank you

github-actions[bot] commented 2 years ago

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

Driekes commented 2 years ago

I have been running weeks without issue. I also reduced the amount of CPU's to just one. I did notice some speed issues, so I tried to increase to two CPU's, but this creates a more unstable environment. I assigned 2,5GB of memory and 2CPU's to Virtualbox, but it has a harder time starting than with one CPU and less memory.

In Windows I see high memory use (but not sure if this is just Virtualbox reservation), and I see continuous CPU usage. The CLI shows that it is stuck at starting hypervisor. After a reboot it works, but I do get more 'rcu_sched' notifications and got an unresponsive HA, which I haven't had for weeks.

Reverting back to 1 CPU solves the issues.

pacoso81 commented 2 years ago

I have been running weeks without issue. I also reduced the amount of CPU's to just one. I did notice some speed issues, so I tried to increase to two CPU's, but this creates a more unstable environment. I assigned 2,5GB of memory and 2CPU's to Virtualbox, but it has a harder time starting than with one CPU and less memory.

In Windows I see high memory use (but not sure if this is just Virtualbox reservation), and I see continuous CPU usage. The CLI shows that it is stuck at starting hypervisor. After a reboot it works, but I do get more 'rcu_sched' notifications and got an unresponsive HA, which I haven't had for weeks.

Reverting back to 1 CPU solves the issues.

Same thing. Currently i'm running one cpu for 1 week and no stalls, if i add 2 cpu's i get stall in 5-10min after starting. This happened after update to 5.5, before that i got stalls whit 2 cpu's once a week. If you get stall most of the times you don't need to restart, just do PAUSE and than UNPAUSE or CTRL+P two times and it will continue, but it might stall again if the process is not finished.

I have read that in linux it is possible that the 2 cpu's can have different timings (not synced to each other) and that can cause big problems and stalls.

agners commented 2 years ago

Interesting findings! I wonder if that is a known issue in upstream VirtualBox?

pacoso81 commented 2 years ago

since the problem got much worse after upgrading HA i'm sure it is HA problem and not VB.

Driekes commented 2 years ago

Yesterday I tried again with the latest and greatest versions of Virtualbox, HA, OS and all. Set it to two CPU's and I think within 30 minutes HA stalled, I got the rcu_sched notifications etc.

I'm not sure if the host hardware could be impacting this, or what is the cause, but it just doesn't work. Or maybe it is impossible to switch between cpu's and I should start fresh with 2 cpu's, but I haven't tested that yet.

pacoso81 commented 2 years ago

Yesterday I tried again with the latest and greatest versions of Virtualbox, HA, OS and all. Set it to two CPU's and I think within 30 minutes HA stalled, I got the rcu_sched notifications etc.

I'm not sure if the host hardware could be impacting this, or what is the cause, but it just doesn't work. Or maybe it is impossible to switch between cpu's and I should start fresh with 2 cpu's, but I haven't tested that yet.

Maybe it is cpu related problem. Mine is AMD V1605B, what CPU do you have?

Driekes commented 2 years ago

Intel Core i5-4690.

pacoso81 commented 2 years ago

well one cpu no problems, but when i opened task manager i noticed that only cpu1&2 are used, others are parked. i hate loadning only one cpu so i unpark the rest of the cpu's, now all of them are utilized by a small amount. we will se how it goes but so far the results are very good. i used program unpark cpu.

chaospheremk commented 2 years ago

I'm getting this same error: rcu_sched self-detected stall on cpu

VirtualBox: 6.1.34 r 150636 (Qt5.6.2) Home Assistant OS: 8.1 Home Assistant Core: 2022.6.1 CPU: Intel Core i7-5930k

I'm using latest version of virtualbox and HA OS 8.1. I had 2 CPUs dedicated to the VM. This would allow HA to run for up to a couple days before freezing. I upped it to 4 CPUs and that made the problem happen much quicker, at least from what I experienced.

After switching the VM to 1 CPU, I'm seeing no issues. This seems to be reproducible and consistent when using anything other than 1 CPU.

agners commented 2 years ago

Maybe this tip works? https://www.virtualbox.org/ticket/20131#comment:2

I found that "perf top" was good at stalling it out a bit, and doing a "vboxmanage modifyvm foo --hpet on" on the host made the problem occur virtually never or not at all for that VM, even while every other VM without that change was stalling.

pacoso81 commented 2 years ago

Maybe this tip works? https://www.virtualbox.org/ticket/20131#comment:2

I found that "perf top" was good at stalling it out a bit, and doing a "vboxmanage modifyvm foo --hpet on" on the host made the problem occur virtually never or not at all for that VM, even while every other VM without that change was stalling.

did that. it doesn't help.

Driekes commented 2 years ago

Tried as well, also didn't work.

pacoso81 commented 2 years ago

here is a video whit explanations regarding cpu stalls. i think what we are dealing is explained on 16:11

https://www.youtube.com/watch?v=23_GOr8Sz-E

agners commented 2 years ago

It essentially means the kernel doesn't get to run on a particular CPU within a certain time limit. That can have different causes:

IMHO, this is a VirtualBox bug. Maybe VirtualBox needs a particular kernel config to run fine, but if that is the case, it should be documented somewhere. Also, HAOS kernel configuration is not really special, it mainly enables a lot of virtualization drivers. It works fine on other Virtual Machines as well, so :man_shrugging:

Driekes commented 2 years ago

I saw in another post https://github.com/home-assistant/operating-system/issues/1737#issuecomment-1108843382 that you can change the paravirtualization settings and that some worked. I now am running on 'minimal' and so far it is running for a few hours, more than the test yesterday. I however also increased memory so it could still be that has some impact. I'll keep you updated.

Driekes commented 2 years ago

So far up and running without issue since my previous post! Looks promising.

curtgrimes commented 2 years ago

I noticed this week that I have been experiencing the same issue with a Home Assistant install I have running in VirtualBox on Windows 10. I may have been experiencing the issue for several weeks/months before this week, but a lot of variables on my end changed about a month ago (installed new networking hardware on computer, caught up on Windows 10 upgrades, reinstalled the latest version of Home Assistant VDI image), so I can't be certain before that point.

I'll try changing System > Accelleration > Paravirtualization interface to KVM per https://github.com/home-assistant/operating-system/issues/1737#issuecomment-1106855304 and report back.

Specs

PC: Edition Windows 10 Pro Version 21H2 OS build 19044.1706 Experience Windows Feature Experience Pack 120.2212.4170.0 Processor Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz 3.60 GHz Installed RAM 32.0 GB System type 64-bit operating system, x64-based processor Pen and touch No pen or touch input is available for this display

VirtualBox 6.1.34 r150636 (Qt5.6.2)

Home Assistant VM

Driekes commented 2 years ago

@curtgrimes the comment below mentions KVM didn't work as I also believe that is equal to default. However, try it!

Mine is now stable with 2 cpus with the minimal setting!

pacoso81 commented 2 years ago

image one cpu, but unparked all cores. VB is sharing all cpu's whit windows. in the time of snapshot i'm updating HA and nothing else is using the cpu. i'm using this method for more than a week with not isuess so far.

mig8447 commented 2 years ago

I tested changing the para-virtualization setting to minimal, and although it lasted longer before freezing, it froze anyway. One curious thing I discovered is that if you enable the virtual keyboard from Virtual Box (Input > Keyboard) and type something, the machine unfreezes, but that's not good either way.

I also noticed that before I saw the rcu errors on the VM screen but now it freezes without telling me anything, I'm quite sure, the issue is still the rcu_sched warnings though

I reduced the number of CPUs to 1 to see if that helps.

mig8447 commented 2 years ago

My setup has been running smoothly for 3 days by just reducing the number of CPUs to 1, even when leaving the default para virtualization parameters

jjcf89 commented 2 years ago

Same issue, I will set cpu's to 1 to see if that helps.

OGTK423 commented 1 year ago

I'm currently experiencing this issue but have also never been able to launch HA in a VM when using 1 CPU, it always results in a bootloop and won't successfully launch unless 2 CPUs is selected.

jjcf89 commented 1 year ago

Same issue, I will set cpu's to 1 to see if that helps.

So far 1 CPU has been stable after 2 days, fingers crossed.

pacoso81 commented 1 year ago

mine is running happy whit one cpu and unparked cores for months now whitout any stall so far. i will continue to work on one cpu untill this problem is fixed.

yzlnew commented 1 year ago

I'm also experiencing this issue on Proxmox VM.

lbouriez commented 1 year ago

Hello, I am also experiencing this issue since yesterday on VirtualBox without any changes to my setup. Did someone get to find a solution other than reducing the CPUs number ?

Driekes commented 1 year ago

I saw in another post #1737 (comment) that you can change the paravirtualization settings and that some worked. I now am running on 'minimal' and so far it is running for a few hours, more than the test yesterday. I however also increased memory so it could still be that has some impact. I'll keep you updated.

@lbouriez Yes, see my comment.

jjcf89 commented 1 year ago

I have a couple of feedback points:

I think you've posted in the wrong place

polter05 commented 1 year ago

Same issue on proxmox :( image

yzlnew commented 1 year ago

Same issue on proxmox :( image

Updating the kernel fix the problem for me.

polter05 commented 1 year ago

Which version ? Mine is Linux 5.15

lbouriez commented 1 year ago

FYI, what I noticed for me is that it was happening while shutting down the VM hardly, I mean like an electricity outtage.

yzlnew commented 1 year ago

@polter05 Update to an Edge Kernel using scripts from https://tteck.github.io/Proxmox/

hprotzek commented 1 year ago

I got the same issue on Proxmox. 66E3823B-F3AF-4257-8679-738046B32C9D_1_105_c

Not sure if it works, but here is a suggestion for a workaround: https://bugzilla.kernel.org/show_bug.cgi?id=199727#c18

Setting VirtIO SCSI Single / iothread=1 / aio=threads on all our KVM guests.

pacoso81 commented 1 year ago

i think it is time for an update. i'm still running on 1 cpu and unparked cores (basically i'm running on all of the cpu's) since June 2022 and no stall whatsoever. for me this is the best solution, even if i didn't had stalls i would run unparked cpu's because that way i can utilize all of my cpu's cores at once.

agners commented 1 year ago

This issue is about rcu stalls on VirtualBox. I opened a new issue for rcu stalls on Proxmox, see #2342.

lobhater commented 1 year ago

So I am running HA in virtual box on windows 11. Everything has been running great and I have been doing monthly updates. This week I installed the add-on RTSPtoWeb - WebRTC and the system started freezing seemingly randomly with no logs except image I uninstall WebRTC and everything is perfect again. Someone on FB of all places linked me here and I read the workaround of changing to 1 cpu which also fixed the issue. Seems this has been an ongoing issue with no solution, right?

pacoso81 commented 1 year ago

Seems this has been an ongoing issue with no solution, right?

So far yes, the only permanent solution is running single core. For me it works great whit unparked cores, i even have 3 cameras running on MotionEye. Unparking the CPU enables you to utilize all off the CPU cores at same time. I have tested without unparked CPU and the same core is utilized all the time, meaning that you stress that same core all the time and shortening the CPU live.

madduck commented 1 year ago

Just had this problem while running 2023.5.3 on libvirt/kvm/plain Debian.