home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.8k stars 959 forks source link

HA OS 13.0 task kthreadd blocked for more than 120 seconds #3534

Open rrooggiieerr opened 1 month ago

rrooggiieerr commented 1 month ago

Describe the issue you are experiencing

Since I upgraded my Raspberry Pi 4 to the latest HA OS 13.0 the system freezes regularly with the message

task kthreadd:2 blocked for more than 120 seconds.

IMG_20240815_135926

Once frozen I can still ping my HA server, but can't SSH into it anymore

What operating system image do you use?

rpi4-64 (Raspberry Pi 4/400 64-bit OS)

What version of Home Assistant Operating System is installed?

3.0

Did the problem occur after upgrading the Operating System?

Yes

Hardware details

Raspberry Pi 4 with 1 GB memory and WD Purple SC QD101 microSDXC 64 GB storage, original power supply and powered USB hub, Phoscon RaspBee II Zigbee module

Steps to reproduce the issue

  1. Upgrade to latest HA OS 13.0
  2. After 5 to 10 minutes the system freezes
  3. After about 20 minutes the system restarts and it starts over again

about 1 in 4 times the system boots up correctly and I can access HA, but then after some hours it crashes again

Anything in the Supervisor logs that might be useful for us?

Can't access my system

Anything in the Host logs that might be useful for us?

2024-08-15 12:04:15.867 homeassistant kernel: audit: type=1325 audit(1723723455.862:191): table=filter:88 family=2 entries=1 op=nft_register_chain pid=2558 subj=docker-default comm="iptables-nft"
2024-08-15 12:04:15.868 homeassistant kernel: audit: type=1300 audit(1723723455.862:191): arch=c00000b7 syscall=211 success=yes exit=160 a0=3 a1=7febafef68 a2=0 a3=0 items=0 ppid=2405 pid=2558 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables-nft" exe="/sbin/xtables-nft-multi" subj=docker-default key=(null)
2024-08-15 12:04:15.868 homeassistant kernel: audit: type=1327 audit(1723723455.862:191): proctitle=69707461626C65732D6E6674002D5000464F525741524400414343455054
2024-08-15 12:04:15.899 homeassistant kernel: audit: type=1325 audit(1723723455.894:192): table=filter:89 family=2 entries=10 op=nft_unregister_rule pid=2559 subj=docker-default comm="iptables-nft"
2024-08-15 12:04:15.900 homeassistant kernel: audit: type=1300 audit(1723723455.894:192): arch=c00000b7 syscall=211 success=yes exit=92 a0=3 a1=7ff3c1b308 a2=0 a3=0 items=0 ppid=2405 pid=2559 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables-nft" exe="/sbin/xtables-nft-multi" subj=docker-default key=(null)
2024-08-15 12:04:15.900 homeassistant kernel: audit: type=1327 audit(1723723455.894:192): proctitle=69707461626C65732D6E6674002D4600464F5257415244
2024-08-15 12:04:16.091 homeassistant kernel: 8021q: 802.1Q VLAN Support v1.8
2024-08-15 12:04:18.647 homeassistant systemd[1]: var-lib-docker-overlay2-4391ae713f43a23c228ec31a12c35b07791c34a1e1e367885619dd2869276147\x2dinit-merged.mount: Deactivated successfully.
2024-08-15 12:04:18.648 homeassistant systemd[1]: mnt-data-docker-overlay2-4391ae713f43a23c228ec31a12c35b07791c34a1e1e367885619dd2869276147\x2dinit-merged.mount: Deactivated successfully.
2024-08-15 12:04:19.139 homeassistant kernel: hassio: port 6(veth33aed7c) entered blocking state
2024-08-15 12:04:19.140 homeassistant kernel: hassio: port 6(veth33aed7c) entered disabled state
2024-08-15 12:04:19.140 homeassistant kernel: veth33aed7c: entered allmulticast mode
2024-08-15 12:04:19.140 homeassistant kernel: veth33aed7c: entered promiscuous mode
2024-08-15 12:04:19.140 homeassistant kernel: audit: type=1700 audit(1723723459.134:193): dev=veth33aed7c prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
2024-08-15 12:04:19.144 homeassistant NetworkManager[515]: <info>  [1723723459.1436] manager: (veth589f35d): new Veth device (/org/freedesktop/NetworkManager/Devices/19)
2024-08-15 12:04:19.150 homeassistant NetworkManager[515]: <info>  [1723723459.1485] manager: (veth33aed7c): new Veth device (/org/freedesktop/NetworkManager/Devices/20)
2024-08-15 12:04:19.452 homeassistant systemd[1]: Started libcontainer container 84ec30f4d55409589db1f07f2275b6f853ddf7aa4cb9801cd6b998f540b48932.
2024-08-15 12:04:19.711 homeassistant kernel: eth0: renamed from veth589f35d
2024-08-15 12:04:19.756 homeassistant kernel: hassio: port 6(veth33aed7c) entered blocking state
2024-08-15 12:04:19.756 homeassistant kernel: hassio: port 6(veth33aed7c) entered forwarding state
2024-08-15 12:04:19.758 homeassistant NetworkManager[515]: <info>  [1723723459.7583] device (veth33aed7c): carrier: link connected
2024-08-15 12:04:24.910 homeassistant systemd[1]: systemd-hostnamed.service: Deactivated successfully.
2024-08-15 12:04:25.011 homeassistant kernel: kauditd_printk_skb: 51 callbacks suppressed
2024-08-15 12:04:25.012 homeassistant kernel: audit: type=1334 audit(1723723465.006:211): prog-id=14 op=UNLOAD
2024-08-15 12:04:25.012 homeassistant kernel: audit: type=1334 audit(1723723465.006:212): prog-id=13 op=UNLOAD
2024-08-15 12:04:25.012 homeassistant kernel: audit: type=1334 audit(1723723465.006:213): prog-id=12 op=UNLOAD
2024-08-15 12:04:25.148 homeassistant systemd[1]: systemd-timedated.service: Deactivated successfully.
2024-08-15 12:04:25.179 homeassistant kernel: audit: type=1334 audit(1723723465.174:214): prog-id=25 op=UNLOAD
2024-08-15 12:04:25.180 homeassistant kernel: audit: type=1334 audit(1723723465.174:215): prog-id=24 op=UNLOAD
2024-08-15 12:04:25.180 homeassistant kernel: audit: type=1334 audit(1723723465.174:216): prog-id=23 op=UNLOAD
2024-08-15 12:04:25.236 homeassistant systemd[1]: Started libcontainer container 155206ea45e51114c2b733b399e916f16caafc36333c0a3b286bae800fb0dbcd.
2024-08-15 12:04:25.295 homeassistant kernel: audit: type=1334 audit(1723723465.290:217): prog-id=50 op=LOAD
2024-08-15 12:04:25.296 homeassistant kernel: audit: type=1300 audit(1723723465.290:217): arch=c00000b7 syscall=280 success=yes exit=15 a0=5 a1=4000195840 a2=78 a3=0 items=0 ppid=2867 pid=2880 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc" exe="/usr/bin/runc" subj=unconfined key=(null)
2024-08-15 12:04:25.296 homeassistant kernel: audit: type=1327 audit(1723723465.290:217): proctitle=72756E63002D2D726F6F74002F7661722F72756E2F646F636B65722F72756E74696D652D72756E632F6D6F6279002D2D6C6F67002F72756E2F636F6E7461696E6572642F696F2E636F6E7461696E6572642E72756E74696D652E76322E7461736B2F6D6F62792F31353532303665613435653531313134633262373333623339
2024-08-15 12:04:25.296 homeassistant kernel: audit: type=1334 audit(1723723465.290:218): prog-id=51 op=LOAD
2024-08-15 12:06:00.427 homeassistant kernel: uart-pl011 fe201000.serial: no DMA platform data
2024-08-15 12:06:15.381 homeassistant kernel: kauditd_printk_skb: 11 callbacks suppressed
2024-08-15 12:06:15.495 homeassistant kernel: audit: type=1334 audit(1723723575.215:222): prog-id=53 op=LOAD
2024-08-15 12:06:15.495 homeassistant kernel: audit: type=1334 audit(1723723575.215:223): prog-id=54 op=LOAD
2024-08-15 12:06:15.520 homeassistant kernel: audit: type=1334 audit(1723723575.215:224): prog-id=55 op=LOAD
2024-08-15 12:06:15.560 homeassistant systemd[1]: Starting Hostname Service...
2024-08-15 12:06:16.315 homeassistant systemd[1]: Started Hostname Service.
2024-08-15 12:06:16.591 homeassistant kernel: audit: type=1334 audit(1723723576.587:225): prog-id=56 op=LOAD
2024-08-15 12:06:16.612 homeassistant kernel: audit: type=1334 audit(1723723576.587:226): prog-id=57 op=LOAD
2024-08-15 12:06:16.612 homeassistant kernel: audit: type=1334 audit(1723723576.587:227): prog-id=58 op=LOAD
2024-08-15 12:06:16.621 homeassistant systemd[1]: Starting Time & Date Service...
2024-08-15 12:06:17.057 homeassistant systemd[1]: Started Time & Date Service.
2024-08-15 12:06:47.251 homeassistant systemd[1]: systemd-hostnamed.service: Deactivated successfully.
2024-08-15 12:06:47.287 homeassistant systemd[1]: systemd-timedated.service: Deactivated successfully.
2024-08-15 12:06:47.669 homeassistant kernel: audit: type=1334 audit(1723723607.279:228): prog-id=58 op=UNLOAD
2024-08-15 12:06:47.669 homeassistant kernel: audit: type=1334 audit(1723723607.279:229): prog-id=57 op=UNLOAD
2024-08-15 12:06:47.669 homeassistant kernel: audit: type=1334 audit(1723723607.279:230): prog-id=56 op=UNLOAD
2024-08-15 12:06:47.669 homeassistant kernel: audit: type=1334 audit(1723723607.379:231): prog-id=55 op=UNLOAD
2024-08-15 12:06:47.670 homeassistant kernel: audit: type=1334 audit(1723723607.379:232): prog-id=54 op=UNLOAD
2024-08-15 12:06:47.670 homeassistant kernel: audit: type=1334 audit(1723723607.379:233): prog-id=53 op=UNLOAD
2024-08-15 12:09:12.642 homeassistant systemd[1]: var-lib-docker-overlay2-e787cc7fceda8f103aebabf4570728705be15a59c036dfad2478ccaf5c243f8f\x2dinit-merged.mount: Deactivated successfully.
2024-08-15 12:09:12.714 homeassistant systemd[1]: mnt-data-docker-overlay2-e787cc7fceda8f103aebabf4570728705be15a59c036dfad2478ccaf5c243f8f\x2dinit-merged.mount: Deactivated successfully.
2024-08-15 12:09:13.080 homeassistant kernel: hassio: port 7(vethfba167f) entered blocking state
2024-08-15 12:09:13.080 homeassistant kernel: hassio: port 7(vethfba167f) entered disabled state
2024-08-15 12:09:13.080 homeassistant kernel: vethfba167f: entered allmulticast mode
2024-08-15 12:09:13.091 homeassistant kernel: vethfba167f: entered promiscuous mode
2024-08-15 12:09:13.091 homeassistant kernel: audit: type=1700 audit(1723723753.060:234): dev=vethfba167f prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
2024-08-15 12:09:13.091 homeassistant kernel: audit: type=1300 audit(1723723753.060:234): arch=c00000b7 syscall=206 success=yes exit=40 a0=d a1=4001e71830 a2=28 a3=0 items=0 ppid=1 pid=613 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="dockerd" exe="/usr/bin/dockerd" subj=unconfined key=(null)
2024-08-15 12:09:13.092 homeassistant kernel: audit: type=1327 audit(1723723753.060:234): proctitle=2F7573722F62696E2F646F636B657264002D480066643A2F2F002D2D636F6E7461696E6572643D2F72756E2F636F6E7461696E6572642F636F6E7461696E6572642E736F636B
2024-08-15 12:09:13.314 homeassistant NetworkManager[515]: <info>  [1723723753.2661] manager: (veth984781a): new Veth device (/org/freedesktop/NetworkManager/Devices/21)
2024-08-15 12:09:13.364 homeassistant NetworkManager[515]: <info>  [1723723753.3640] manager: (vethfba167f): new Veth device (/org/freedesktop/NetworkManager/Devices/22)
2024-08-15 12:09:13.453 homeassistant dockerd[613]: time="2024-08-15T12:09:13.452252945Z" level=warning msg="Failed to allocate and map port 443-443: Bind for 0.0.0.0:443 failed: port is already allocated"
2024-08-15 12:09:13.488 homeassistant kernel: hassio: port 7(vethfba167f) entered disabled state
2024-08-15 12:09:13.546 homeassistant kernel: vethfba167f (unregistering): left allmulticast mode
2024-08-15 12:09:13.551 homeassistant kernel: vethfba167f (unregistering): left promiscuous mode
2024-08-15 12:09:13.552 homeassistant kernel: hassio: port 7(vethfba167f) entered disabled state
2024-08-15 12:09:13.552 homeassistant kernel: audit: type=1700 audit(1723723753.480:235): dev=vethfba167f prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
2024-08-15 12:09:13.583 homeassistant kernel: audit: type=1300 audit(1723723753.480:235): arch=c00000b7 syscall=206 success=yes exit=32 a0=d a1=4001ad2800 a2=20 a3=0 items=0 ppid=1 pid=613 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="dockerd" exe="/usr/bin/dockerd" subj=unconfined key=(null)
2024-08-15 12:09:13.584 homeassistant kernel: audit: type=1327 audit(1723723753.480:235): proctitle=2F7573722F62696E2F646F636B657264002D480066643A2F2F002D2D636F6E7461696E6572643D2F72756E2F636F6E7461696E6572642F636F6E7461696E6572642E736F636B
2024-08-15 12:09:13.641 homeassistant systemd[1]: var-lib-docker-overlay2-e787cc7fceda8f103aebabf4570728705be15a59c036dfad2478ccaf5c243f8f-merged.mount: Deactivated successfully.
2024-08-15 12:09:13.642 homeassistant systemd[1]: mnt-data-docker-overlay2-e787cc7fceda8f103aebabf4570728705be15a59c036dfad2478ccaf5c243f8f-merged.mount: Deactivated successfully.
2024-08-15 12:09:13.787 homeassistant dockerd[613]: time="2024-08-15T12:09:13.787270443Z" level=error msg="Handler for POST /v1.45/containers/77723e5f0e954941ff63a966cdf5ea19d0f2822af93f5a45296b50a9fd36dd25/start returned error: driver failed programming external connectivity on endpoint addon_core_nginx_proxy (7fed2bd72200cb354cc69be05973f0f46bf58c829c063888f32128da7ec8aa9f): Bind for 0.0.0.0:443 failed: port is already allocated" spanID=fec057372fd04fda traceID=e6acf3861d1db164c4a0c3f75b088501
2024-08-15 12:16:40.824 homeassistant dropbear[3315]: [3315] Aug 15 12:16:40 Child connection from fe80::cc1:d4ef:da1e:a2bf%end0:51348
2024-08-15 12:16:41.046 homeassistant dropbear[3315]: [3315] Aug 15 12:16:41 Pubkey auth succeeded for 'root' with ssh-rsa key SHA256:HdSjbsv1W4rF17b8pQxlXWLOxW/624uKFklM71lA7Eg from fe80::cc1:d4ef:da1e:a2bf%end0:51348

System information

No response

Additional information

No response

dpgh947 commented 1 month ago

I just hit this. Pi4 with SSD. I updated from 12.4 to 13.0 and HA never came up. I had to retrieve the pi from where it lives on top of a cupboard and plug it in to a screen and keyboard to see what it was doing. It was booting to the CLI, then after a couple of minutes I was getting the same messages as shown above, and then it rebooted. Left alone, it just kept doing this, HA never came up.

I got in to the CLI and then discovered via command "ha os info" that there are 2 boot slots that it flip flops between when an update is applied. If it fails to boot 3 times it reverts to the other slot, but this problem is obviously letting the os come up far enough to prevent that triggering. I then used command "ha os boot-slot other" to force it to the other slot, which had previous 12.4 in it. The system then came up OK.

EastArctica commented 1 month ago

I'm dealing with this too! Crashing every few minutes. Occasionally I'll also see it throw up a stack trace too but I don't know of any way to have those traces be saved. I'll try the boot slot change now...

Before:

# ha os info
board: rpi4-64
boot: B
boot_slots:
  A:
    state: inactive
    status: good
    version: null
  B:
    state: booted
    status: good
    version: "13.0"
data_disk: SN128-0x7e54b116
update_available: false
version: "13.0"
version_latest: "13.0"

After:

# ha os info
board: rpi4-64
boot: A
boot_slots:
  A:
    state: booted
    status: good
    version: null
  B:
    state: inactive
    status: good
    version: "13.0"
data_disk: SN128-0x7e54b116
update_available: true
version: "12.4"
version_latest: "13.0

Looks like it's back on 12.4 even though the version is null under the boot slot version. Hopefully this fixes it for now...

Edit: Seems to have been reliable back on 12.4 for me. I was experiencing some crashes but I think that's because of another bug in HA Core to do with ESPHome running out of memory and crashing the host. I increased swap and it seems OK.

rrooggiieerr commented 1 month ago

@dpgh947 Thanks for looking into this, I have just rolled back to 12.4 using ha os update --version 12.4

Using your proposed ha os info I also see 2 boot slots:

board: rpi4-64
boot: B
boot_slots:
  A:
    state: inactive
    status: good
    version: "13.0"
  B:
    state: booted
    status: good
    version: "12.4"
data_disk: WX64G-0x14ec9103
update_available: true
version: "12.4"
version_latest: "13.0"
DeXter3306 commented 1 month ago

Thanks for the tip of reverting back to 12.4 ! Got the same issue on RPi 3B+ with USB SSD (also tried with a USB HDD). When the Pi is hanging like that, is it possible to extract some logs or so ? Really want to help on this case.

sairon commented 1 month ago

The error messages basically mean the system is too busy to handle kernel tasks. Can you watch memory and CPU usage at the Hardware page to check whether it's not hitting limits before it becomes unresponsive? The common denominator for this report and @dipseth's in the other issue is RPi with 1GB RAM, which is sufficient only for very simple HA setup.

dpgh947 commented 1 month ago

Mine is a pi4 with 2gb, never had a problem before.

EastArctica commented 1 month ago

The error messages basically mean the system is too busy to handle kernel tasks. Can you watch memory and CPU usage at the Hardware page to check whether it's not hitting limits before it becomes unresponsive? The common denominator for this report and @dipseth's in the other issue is RPi with 1GB RAM, which is sufficient only for very simple HA setup.

I have pi4 with 4GB and was experiencing this. Although, when recording both top from the os and the info from the hardware tab I was unable to get the device to crash after 1 hour and 20 minutes. Perhaps I was only encountering this when I was doing something on home assistant?

DeXter3306 commented 1 month ago

The error messages basically mean the system is too busy to handle kernel tasks. Can you watch memory and CPU usage at the Hardware page to check whether it's not hitting limits before it becomes unresponsive? The common denominator for this report and @dipseth's in the other issue is RPi with 1GB RAM, which is sufficient only for very simple HA setup.

I understand that the RPi 3B+ is quite old and after 5 years of loyal duty I should consider to replace it but still, I'm trying to understand why it happened after the 13.0 update and it's not occuring when I run 12.4.

The ram shortage was quite mitigated by my bigger SWAP config.

In order to downgrade to 12.4, I had to stop every add-ons, and avoid connecting to the UI (otherwise the crash would occur). When I downgraded to 12.4, I re-enabled the add-ons and I have 0 issues since then.

EastArctica commented 1 month ago

The error messages basically mean the system is too busy to handle kernel tasks. Can you watch memory and CPU usage at the Hardware page to check whether it's not hitting limits before it becomes unresponsive? The common denominator for this report and @dipseth's in the other issue is RPi with 1GB RAM, which is sufficient only for very simple HA setup.

I understand that the RPi 3B+ is quite old and after 5 years of loyal duty I should consider to replace it but still, I'm trying to understand why it happened after the 13.0 update and it's not occuring when I run 12.4.

The ram shortage was quite mitigated by my bigger SWAP config.

In order to downgrade to 12.4, I had to stop every add-ons, and avoid connecting to the UI (otherwise the crash would occur). When I downgraded to 12.4, I re-enabled the add-ons and I have 0 issues since then.

This reminds me that in between when it was crashing and now, where it is no longer crashing, I increased my swap to prevent ESPHome crashes.

dipseth commented 1 month ago

The ram shortage was quite mitigated by my bigger SWAP config

I had done something similar, adding a 600mb SWAP file. This is still required in 12.4 but runs without any issues now.

rrooggiieerr commented 1 month ago

@sairon Are you saying this as a HA OS developer, or just making a statement? What are your sources that 1 GB is not enough? My current memory usage on HA OS 12.4 is 676.3 MiB/75 % with 53 integrations active, 155 devices and 3 add ons active

rrooggiieerr commented 1 month ago

@EastArctica you can run ESPHome on your laptop or local PC, this wil definitely take some load away from your HA device

sairon commented 1 month ago

@rrooggiieerr

Are you saying this as a HA OS developer, or just making a statement?

I don't know what you mean by either of those, so I will elaborate a bit. On one hand, 1 GB RAM can be fine, from those who opted in for the analytics, 7 % are running on RPi 3 which only has 1 GB of RAM. However, your mileage may vary. 3 add-ons and 53 integrations is still something I'd call conservative usage, as especially the amount of add-ons makes a bigger difference. Compared to you, dipseth has 18 add-ons, and many people are using HA as a self-hosting platform, where RAM starts to be scarce very soon. For other platforms minimum of 2 GB RAM is generally recommended in the docs, for CM4 in Yellow it's recommended as well. From my purely personal experience, only one real HA deployment out of three I manage, only one sits just slightly below 1 GB after a while of usage.

In your case, the remaining 25 % of RAM can become insufficient quite quickly. It also needs to be considered the system usually performs well if it also has some RAM available for page caches, if not, it can lead to higher I/O and combined with swapping on a (rather) slow media, it can be very detrimental to performance. Which leads me to another thought - the SD card you are using - while it's a good choice in terms of endurance - might not be the best for this usage in overall. Per the description it's optimized for use in security cameras, and has no Application Performance Class, so there's no IOPS guarantee. For HAOS it's recommended to use cards of A2 class which perform better in scenarios like this.

In summary, the issue definitely looks like performance related. To see what's going on, having an HDMI display connected, typing login in the HA CLI and checking free and uptime (or just watching top) when the system becomes unstable could help diagnose if the theory is right or not.

DeXter3306 commented 1 month ago

Thanks for the insight. But, if I may, I still don't understand why it is working very well on 12.4 and not on 13.0, like my workload doesn't change between both. That's puzzling me right now :)

sairon commented 1 month ago

@DeXter3306 It's hard to tell without any details about your deployment. Doing what I suggested in the last paragraph of the previous post could help. Also, the kernel doesn't log all information by default, there might be more in dmesg or host logs (i.e. ha host logs or journalctl -e).

EastArctica commented 1 month ago

In summary, the issue definitely looks like performance related. To see what's going on, having an HDMI display connected, typing login in the HA CLI and checking free and uptime (or just watching top) when the system becomes unstable could help diagnose if the theory is right or not.

I made a backup of my current system and reverted back to one from when I was experiencing issues before and alas! I am experiencing them once again (didn't think I'd be happy to say that). So far I've been doing most of my monitoring and diag through ssh to the OS, but when the host seems to "die" it tends to kill the network too. Through the terminal on the OS itself (via keyboard + HDMI), how can I run shell commands? I'm sorta just stuck in the home assistant CLI and don't know how to break out of it...

I think I can confirm that it is an issue with something HA Core related as the backup I had previously was ONLY HA Core and not a full backup (remember how I said that I couldn't get it to crash before and now I can, although I did have the swap increase but that shouldn't have persisted)

Edit: Just saw memory spike to 3.6GB right before the entire system froze (unable to type in console, ssh wont connect, webserver wont respond) Same thing again: https://i.imgur.com/ljideho.png Got a picture of the stack trace I was talking about before! (I apologize in advance for quality, my system is entirely frozen after all so I can't do much) https://i.imgur.com/LrIZ4kN.png

The stack trace seems to be the system failing to exit correctly because it's out of memory which it's trying to exit because it ran out of memory...

After an hour of testing, it's the whisper model. I don't even know why I had it installed but the whisper addon is 100% what was causing my crashes at least. Currently restoring to my pre-testing backup and I'll reinstall the whisper addon to see if it causes crashes there too, then do the same test with a swap increase.

dpgh947 commented 1 month ago

I tried 13.1 this morning, still broken. HA was just starting to come up, then it rebooted, over and over. This time I managed to get in to ssh and issue the boot-slot other command to get back to 12.4 rather than getting the pi out again and plugging in a screen, so I can't confirm the same messages that I saw before, but whatever it is, I still cannot upgrade my 2gb pi4.

dpgh947 commented 1 month ago

Well, on a hunch, I have been playing around turning off "start at boot" for some addons. I turned off esphome, plex server, wireguard, samba and chose to install 13.1 again. It booted ok. I started esphome and samba manually, all ok. I turned on start at boot for esphome and samba, and rebooted, again came up ok. I started wireguard manually, it started ok. I started plex manually, system immediately stopped responding and then rebooted. Came up OK as I hadn't set those to start. I started plex manually first this time, started ok. Started wireguard manually, came up ok. Go figure...

I don't use plex or wireguard at the moment, so leaving start at boot for those turned off seems to have alleviated the problem. Whether this is due to an actual problem in one of those, or it's just some sort of resource issue during a full boot (maybe the order things happen has changed?) - I have no idea.

EDIT - it rebooted after about half an hour, back to 12.4 again.

rrooggiieerr commented 1 week ago

Anyone else using the onboard serial port of the Raspberry PI? In my case the Phoscon RaspBee II. I'm starting to think this could be one of the causes