home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
5.08k stars 992 forks source link

rcu: INFO: rcu_sched self-detected stall on CPUs/tasks on Proxmux #2342

Closed agners closed 1 year ago

agners commented 1 year ago
          Same issue on proxmox :(

image

Originally posted by @polter05 in https://github.com/home-assistant/operating-system/issues/1705#issuecomment-1410238093

Home Assistant OS 9.5

Note: This is an issue which tracks CPU stalls on Proxmox. It is similar to #1705 which tracks such issues on VirtualBox, but likely a different culprit (since Proxmox and VirtualBox are different Hypervisors!). Please post to the appropriate issue.

agners commented 1 year ago

Updating the Proxmox kernel can solve the issue, see https://github.com/home-assistant/operating-system/issues/1705#issuecomment-1411406865

Also setting VirtIO SCSI Single / iothread=1 / aio=threads on all our KVM guests, see https://github.com/home-assistant/operating-system/issues/1705#issuecomment-1418808236.

hprotzek commented 1 year ago

Updating the Proxmox kernel can solve the issue, see #1705 (comment)

Also setting VirtIO SCSI Single / iothread=1 / aio=threads on all our KVM guests, see #1705 (comment).

I did both, issue is gone.

tteck commented 1 year ago

Proxmox has an opt-in for a Linux 6.1 Kernel, the development of edge Kernels will temporally pause with the 6.0 kernel.

https://forum.proxmox.com/threads/opt-in-linux-6-1-kernel-for-proxmox-ve-7-x-available.119483/

hprotzek commented 1 year ago

Switching to the pve 6.1 Kernel re-introduced the issue again. So far it seems, that only the 6.0.15-edge kernel fixed the problem and not the VirtIO SCSI Single / iothread=1 / aio=threads settings.

hprotzek commented 1 year ago

Switching to the pve 6.1 Kernel re-introduced the issue again. So far it seems, that only the 6.0.15-edge kernel fixed the problem and not the VirtIO SCSI Single / iothread=1 / aio=threads settings.

I was wrong, no matter which kernel or VirtIO settings, the hassos vm crashed soon after start with version 9.5

I did a downgrade to 9.4 and it's stable now for a while running with the edge kernel.

heikkis commented 1 year ago

I'm facing same problem (latest HAOS). I made issue to Proxmox forum: https://forum.proxmox.com/threads/host-hangs-up-after-12-24-hours-unless-rcu_sched-kthread-gets-sufficient-cpu-time-oom-is-now-expected-behavior.122347/

Started the VM 14 hours ago with Linux intelnuc 6.1.10-1-pve and VirtIO SCSI Single / iothread=1 / aio=threads settings. Let's see...

agners commented 1 year ago

HAOS 9.5 mainly comes with a new stable kernel release 5.15.90 (HAOS 9.4 was using 5.15.80). The current development build contains a newer kernel 5.15.93, that might be worth a try. It is typically fairly safe to upgrade to development builds and downgrade back to stable builds, but I still recommend taking a snapshot :smile:

ha supervisor options --channel dev
ha supervisor reload
ha supervisor update
ha os update

And to downgrade:

ha su options --channel stable
ha supervisor reload
ha os update --version 9.5
michaeldwilliams commented 1 year ago

I'm seeing this same issue in a VM on an M2 Mac using UTM. Happens every couple days and throws off the VM's time/date (Sept 2059)

ha › (30157.0233571 rcu: INFO: rcu_preempt self-detected stall on CPU
[30157.0283551 reu: 02-..: (2 ticks this GP) Idle-361/1/0x4000000000000002 softirq=283006/283006 fuss1.
(30157.0302231 reu: reu preempt kthread starved for 288245252820 Jlfries? g745309 fOx0 RCULGP_MAIT_14S(5) -›state-0x0 -›epu-3 (30157.0308861 rcu: oUnless rcu_preempt kthread gets sufficient CPU tine,
00M is now expected behauior.
[30157.031037] rcu: RCU grace-period kthread stack dump:
(30153.036108] rcu: Stack dump where RU GP kthread last ran:
hprotzek commented 1 year ago

HAOS 9.5 mainly comes with a new stable kernel release 5.15.90 (HAOS 9.4 was using 5.15.80). The current development build contains a newer kernel 5.15.93, that might be worth a try. It is typically fairly safe to upgrade to development builds and downgrade back to stable builds, but I still recommend taking a snapshot 😄

ha supervisor options --channel dev
ha supervisor reload
ha supervisor update
ha os update

And to downgrade:

ha su options --channel stable
ha supervisor reload
ha os update --version 9.5

I tried to update the latest dev os, which was 10.0, but HA didn't start with that version. All I got were OutOfMemory exceptions.

agners commented 1 year ago

@hprotzek how much memory do you allocate to the system?

hprotzek commented 1 year ago

@hprotzek how much memory do you allocate to the system?

4GB ballooning disabled

This issue is very strange, I have 2 identical hardware setups, HP t630, running both same Proxmox version with HomeAssistant. One installation runs fine, the other is having this issue with 9.5 All other vm's and containers are working fine. On the faulty one I also get sometimes these errors, but HA runs with 9.4 stable even with this

752.6925871 xhci_ hed 0000:02:1b.O: ERROR Transfer event IRB DMA ptr not part of current TD ep index 2 comp_code 4
jellevervloessem commented 1 year ago

@agners, I also updated to 10.0dev version. But get no IP address assigned. Via the CLI in proxmox console I got to set the IPaddress.

But I now have issues inv 10.0dev that my Mosquitto-broker wont start anymore (because of 'Error: Unable to create websockets listener on port 1884'). Since this is crucial for me, I will downgrade back to 9.5

heikkis commented 1 year ago

Got same error in HA OS 9.4 also for first time. Uptime was 42 days. Previously tested HA OS 10dev. It worked but crashed for same problem fast.

Strontvlieg commented 1 year ago

Have also the same problems on version 9.5 afther 2 hours and a fresh install. Now i updated to os version 11.0.dev20230328 and waiting for troubles ;-)

ariekraakjr commented 1 year ago

Same problem here. Use the 9.5 version.

Naamloos

c00ldude1oo commented 1 year ago

getting this too. im on haos 9.5 proxmox 7.4-3

jellevervloessem commented 1 year ago

Have also the same problems on version 9.5 afther 2 hours and a fresh install. Now i updated to os version 11.0.dev20230328 and waiting for troubles ;-)

Same here... the stable 10.0 came through and only gave me the same issues. Now I'm trying the 11.0.dev20230420. Fingers crossed!

Update: 2 days later and the 11.0 dev version has the same issue. 🥴

@agners do you have a'y idea in which version this will be solved? Or do I need to update proxmox?

ariekraakjr commented 1 year ago

Edit: Found a real solution by intalling intel-microcode. See my later post.

Because coming home in the dark drove me crazy, I made a script that resets the haos vm if it misses a certain number of pings. Have it running on the proxmox host that haos is running on.

#!/bin/bash

        FILE=errors.txt
        TARGET=10.20.0.11
        VMID=102
        FAILLEVEL=20
        ERRORCOUNTER=0
        pinginterval=1

          touch $FILE
          while true;
          do
            DATE=$(date '+%d/%m/%Y %H:%M:%S')
            ping -c 1 $TARGET &> /dev/null
            if [[ $? -ne 0 ]]; then
              if [[ $ERRORCOUNTER -eq 0 ]]; then
                echo $DATE $TARGET "down">> $FILE
              fi
              #sed '${s/$/%/}' $FILE
              let ERRORCOUNTER++
              if [[ $ERRORCOUNTER -eq $FAILLEVEL ]]; then
                echo $DATE $TARGET "- Reset " $VMID >> $FILE
                qm reset $VMID
              fi
            else
              if [[ $ERRORCOUNTER > 0  ]]; then
                echo $DATE $TARGET "up again (" $ERRORCOUNTER "missed pings)" >> $FILE
              fi
              ERRORCOUNTER=0
            fi
              sleep $pinginterval
          done
jellevervloessem commented 1 year ago

My VM was still on 2 cores. I just updated to 1 core. "Start the clock" 🤞

Michaelcombs commented 1 year ago

I'm having the same issue on proxmox 7.4-3 and HA OS 10.3.

Any update on this issue?

ariekraakjr commented 1 year ago

I solved my problems on my Proxmox cluster. In my test environment I ran into similar problems with OPNsense. In the end, the installation of the intel microcode on the Proxmox host proved to be the solution. I have an Intel N5105 processor that caused the problems. See below for more information.

https://forum.opnsense.org/index.php?topic=32406.msg156769#msg156769 https://forum.proxmox.com/threads/vm-freezes-irregularly.111494/page-30

agners commented 1 year ago

Interesting, microcode update. It also seems related to cpuidle issues, I guess that can influence timers/timing and cause such RCU issues indeed :thinking:

@polter05 can you try this fix on your end?

ariekraakjr commented 1 year ago

It's four days now and still no freezing vm's. Before updating the microcode I had to reset the haos vm 2+ times a day.

Strontvlieg commented 1 year ago

Fixed by installing the microcode on my Topton N6005

Step 1: Add the following to the file /etc/apt/sources.list deb http://ftp.se.debian.org/debian bullseye main contrib non-free deb http://ftp.se.debian.org/debian bullseye-updates main contrib non-free

Step 2: apt update

Step 3: apt-get install intel-microcode

Step 4: Reboot the Proxmox system

agners commented 1 year ago

Ok thanks for the information.

So I assume then that this can also be resolved that way for @polter05. Since there wasn't a change in OS I mark it as won't fix.