home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.86k stars 965 forks source link

13.1 - System crashing probably due to USB driver #3575

Open martymarty004 opened 1 month ago

martymarty004 commented 1 month ago

Describe the issue you are experiencing

In the latest months, my HA instance running on a RPi 4 Model B (4GB RAM, using USB SSD) was crashing at random times, while having some strange behavior. Initially I blamed the PSU, but after replacing it, the instance crashed just a few hours later.

Every time the server crashes, there is a small chance that the watchdog forces a reboot, but most of the time I come home to a dead RPi with the activity LED (both the one on the board and the one on the SSD) flashing with constant timing.

It was only after a few months that I noticed that the instance would start to slowly break way before it crashed completely: the homepage was fine, but maybe some pages were broken, and trying to do something different would often cause a reboot.

I finally had a revelation a week ago, when I configured the automation for when my Zigbee devices go offline. I got notified that all my devices were offline, like if coordinator was dead. After checking, I noticed that the coordinator had become unavailable, and it was the same for the UPS connected via USB. When I checked the RPi, the SSD activity light was flashing constantly. A few minutes later, the server managed to force a reboot.

From what I saw, I believe that the entire USB stack is crashing somehow, causing ZHA and NUT to fail, while removing access to the main disk, probably causing all the unexpected behavior.

At this point, I don't know how to further analyze the issue, because the system is way too broken in that state: SSH connections are refused and nothing happens when I plug in a monitor to see what's going on, and of course, the logs don't provide any useful information, since they can't be written to disk after the issue start.

The only thing I thought about in order to help you diagnose the problem was to maybe write the logs on the unused MicroSD so we are able to retrieve them, but I don't know how I should do that.

What operating system image do you use?

rpi4-64 (Raspberry Pi 4/400 64-bit OS)

What version of Home Assistant Operating System is installed?

13.1

Did the problem occur after upgrading the Operating System?

No

Hardware details

Main board: Raspberry Pi 4 Model B, 4GB RAM

USB devices:

ID: /dev/disk/by-id/ata-SATA_SSD_MX_00000000000000019

DEVLINKS: >- /dev/disk/by-diskseq/28 /dev/disk/by-id/ata-SATA_SSD_MX_00000000000000019 /dev/disk/by-id/usb-SSK_SSK_Storage_DD564198838FE-0:0 /dev/disk/by-id/wwn-0x50000000000028b9 /dev/disk/by-path/platform-fd500000.pcie-pci-0000:01:00.0-usb-0:1:1.0-scsi-0:0:0:0 /dev/disk/by-path/platform-fd500000.pcie-pci-0000:01:00.0-usbv3-0:1:1.0-scsi-0:0:0:0 DEVNAME: /dev/sda DEVPATH: >- /devices/platform/scb/fd500000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/usb2/2-1/2-1:1.0/host0/target0:0:0/0:0:0:0/block/sda DEVTYPE: disk DISKSEQ: '28' ID_ATA: '1' ID_ATA_DOWNLOAD_MICROCODE: '1' ID_ATA_FEATURE_SET_PM: '1' ID_ATA_FEATURE_SET_PM_ENABLED: '1' ID_ATA_FEATURE_SET_SECURITY: '1' ID_ATA_FEATURE_SET_SECURITY_ENABLED: '0' ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN: '2' ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN: '2' ID_ATA_FEATURE_SET_SMART: '1' ID_ATA_FEATURE_SET_SMART_ENABLED: '1' ID_ATA_PERIPHERAL_DEVICE_TYPE: '0' ID_ATA_ROTATION_RATE_RPM: '0' ID_ATA_SATA: '1' ID_ATA_SATA_SIGNAL_RATE_GEN1: '1' ID_ATA_SATA_SIGNAL_RATE_GEN2: '1' ID_ATA_WRITE_CACHE: '1' ID_ATA_WRITE_CACHE_ENABLED: '1' ID_BUS: ata ID_MODEL: SATA_SSD ID_MODEL_ENC: >- SATA\x20SSD\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20 ID_PART_TABLE_TYPE: gpt ID_PART_TABLE_UUID: 78dfb754-d33b-4312-8b24-f9e9c1a5b65c ID_PATH: platform-fd500000.pcie-pci-0000:01:00.0-usb-0:1:1.0-scsi-0:0:0:0 ID_PATH_TAG: platform-fd500000_pcie-pci-0000_01_00_0-usb-0_1_1_0-scsi-0_0_0_0 ID_PATH_WITH_USB_REVISION: platform-fd500000.pcie-pci-0000:01:00.0-usbv3-0:1:1.0-scsi-0:0:0:0 ID_REVISION: H220614a ID_SERIAL: SATA_SSD_MX_00000000000000019 ID_SERIAL_SHORT: MX_00000000000000019 ID_TYPE: disk ID_USB_DRIVER: uas ID_USB_INSTANCE: '0:0' ID_USB_INTERFACES: ':080650:080662:' ID_USB_INTERFACE_NUM: '00' ID_USB_MODEL: SSK_Storage ID_USB_MODEL_ENC: SSK\x20Storage ID_USB_MODEL_ID: '0580' ID_USB_REVISION: '7601' ID_USB_SERIAL: SSK_SSK_Storage_DD564198838FE-0:0 ID_USB_SERIAL_SHORT: DD564198838FE ID_USB_TYPE: disk ID_USB_VENDOR: SSK ID_USB_VENDOR_ENC: SSK\x20\x20\x20\x20\x20 ID_USB_VENDOR_ID: 152d ID_WWN: '0x50000000000028b9' ID_WWN_WITH_EXTENSION: '0x50000000000028b9' MAJOR: '8' MINOR: '0' SUBSYSTEM: block TAGS: ':systemd:' USEC_INITIALIZED: '2857679'


- ConBee II from Dresden Elektronik
- UPS ERA PLUS 1100 from Tecnoware

### Steps to reproduce the issue

Nothing specific, the error occurs randomly. It could happen after a few hours or even weeks.

### Anything in the Supervisor logs that might be useful for us?

```txt
Logs are not written to disk, there's the usual warning about the unfinished session.

2024-09-02 02:15:38.702 WARNING (Recorder) [homeassistant.components.recorder.util] The system could not validate that the sqlite3 database at //config/home-assistant_v2.db was shutdown cleanly
2024-09-02 02:15:38.793 WARNING (Recorder) [homeassistant.components.recorder.util] Ended unfinished session (id=676 from 2024-08-26 19:04:14.190714)

(HA UI shows local time (UTC+2) instead of UTC)

Anything in the Host logs that might be useful for us?

Logs are not written to disk. The system looked fine, but the disk had been offline for hours before the crash. Here's what journaltcl recorded:

Sep 02 00:14:18 homeassistant systemd-timesyncd[594]: Initial clock synchronization to Mon 2024-09-02 00:14:18.305382 UTC.
Sep 02 00:14:18 homeassistant systemd-timesyncd[594]: Contacted time server [2606:4700:f1::1]:123 (time.cloudflare.com).
Sep 02 00:14:18 homeassistant systemd-time-wait-sync[593]: adjtime state 0 status 2000 time Mon 2024-09-02 00:14:18.305499 UTC
Sep 02 00:14:18 homeassistant systemd-resolved[485]: Clock change detected. Flushing caches.
Sep 02 00:12:04 homeassistant systemd[1]: Reached target System Time Set.
Sep 02 00:12:04 homeassistant systemd[1]: Started Network Time Synchronization.
Sep 02 00:12:04 homeassistant systemd-time-wait-sync[593]: adjtime state 5 status 40 time Mon 2024-09-02 00:12:04.799924 UTC
Sep 02 00:12:04 homeassistant systemd-timesyncd[594]: System clock time unset or jumped backwards, restored from recorded timestamp: Mon 2024-09-02 00:12:04 UTC
Sep 02 00:12:04 homeassistant systemd-resolved[485]: Clock change detected. Flushing caches.
-- Boot 634cf3d30c344ba19b0f66394ac5ade8 --
Sep 01 16:25:26 homeassistant addon_core_configurator[608]: INFO:2024-09-01 18:25:26,771:hass_configurator.configurator:127.0.0.1 - "GET / HTTP/1.1" 200 -
Sep 01 16:25:14 homeassistant homeassistant[608]: 2024-09-01 18:25:14.727 INFO (MainThread) [homeassistant.components.automation.nuova_automazione_2] TRADFRI Soggiorno 1: Executing step setting variables
Sep 01 16:25:14 homeassistant homeassistant[608]: 2024-09-01 18:25:14.727 INFO (MainThread) [homeassistant.components.automation.nuova_automazione_2] TRADFRI Soggiorno 1: Running automation actions
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   persistent keepalive: every 25 seconds
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   allowed ips: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: peer: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   persistent keepalive: every 25 seconds
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   allowed ips: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: peer: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   persistent keepalive: every 25 seconds
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   allowed ips: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: peer: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   listening port: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   private key: (hidden)
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]:   public key: *REDACTED*
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: interface: wg0
Sep 01 16:25:06 homeassistant addon_a0d7b954_wireguard[608]: [18:25:06] INFO: Requesting current status from WireGuard...
Sep 01 16:25:04 homeassistant hassio_dns[608]: [INFO] 172.30.32.1:34154 - 25880 "A IN wlan0.local.hass.io. udp 37 false 512" NXDOMAIN qr,aa,rd 37 0.00030805s
Sep 01 16:24:56 homeassistant addon_core_configurator[608]: INFO:2024-09-01 18:24:56,594:hass_configurator.configurator:127.0.0.1 - "GET / HTTP/1.1" 200 -

System information

System Information

version core-2024.8.3
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.12.4
os_name Linux
os_version 6.6.31-haos-raspi
arch aarch64
timezone Europe/Rome
config_dir /config
Home Assistant Community Store GitHub API | ok -- | -- GitHub Content | ok GitHub Web | ok HACS Data | ok GitHub API Calls Remaining | 5000 Installed Version | 2.0.0 Stage | running Available Repositories | 1388 Downloaded Repositories | 13
Home Assistant Cloud logged_in | false -- | -- can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | ok
Home Assistant Supervisor host_os | Home Assistant OS 13.1 -- | -- update_channel | stable supervisor_version | supervisor-2024.08.0 agent_version | 1.6.0 docker_version | 26.1.4 disk_total | 116.7 GB disk_used | 12.4 GB healthy | true supported | true host_connectivity | true supervisor_connectivity | true ntp_synchronized | true virtualization | board | rpi4-64 supervisor_api | ok version_api | ok installed_addons | File editor (5.8.0), chrony (3.0.1), Terminal & SSH (9.14.0), Mosquitto broker (6.4.1), Duck DNS (1.18.0), AdGuard Home (5.1.1), Home Assistant Google Drive Backup (0.112.1), Vaultwarden (Bitwarden) (0.22.0), Log Viewer (0.17.0), Samba share (12.3.2), Network UPS Tools (0.13.1), WireGuard (0.10.2), Glances (0.21.1), UniFi Network Application (3.2.0), Nginx Proxy Manager (1.0.1), MariaDB (2.7.1), Node-RED (18.0.5), Vaultwarden (Bitwarden) (0.3.1), DHCP server (1.4.0), ESPHome (2024.8.1)
Dashboards dashboards | 2 -- | -- resources | 11 views | 5 mode | storage
Recorder oldest_recorder_run | 26 agosto 2024 alle ore 12:23 -- | -- current_recorder_run | 2 settembre 2024 alle ore 02:15 estimated_db_size | 333.61 MiB database_engine | sqlite database_version | 3.45.3

Additional information

No response

sairon commented 2 weeks ago

I don't think it needs to be "the entire USB stack" to fail, all it needs is the USB drive to fail, then the behavior would be similar as you describe. Also, if there was a systematic problem of the USB stack, I'm pretty sure there would be more reports of this happening. Relocating the logs to a different media isn't an easy task, generally, on RPi it's better to use the data disk setup which would prevent this from happening - even if the data partition became inaccessible, the system would be still running and you could SSH into it (at least using the developer SSH on port 22222).

martymarty004 commented 2 weeks ago

I know that having the system running on the internal MicroSD would be better, but I had too many cases of dead cards in my life, so I decided to move everything to the external drive.

The thing that seems strange to me is how all usb devices seem to stop working at the same time (I stop getting data from the UPS and zigbee devices go offline). HomeKit is maybe the service that lasts the longer (and the reason why nobody notices anything is wrong at first).

The web UI is one of the first things that stops working, followed by the debug ssh, but without access to the disk, there is plenty of space for bizarre behavior.

Still, I'd like to save the logs to another drive in order to see what's going on.

brotwi commented 2 weeks ago

I have the same problem on a rpi 3B+, I disconnect the power and start again, the boot seems normal in sequence without errors, when the system finishes loading the web interface loses the connection, restarts again but fails to boot. I changed the power supply, tried with another ssd, changed the usb ssd adapter, changed the network cable. without result. I migrated my server to a virtual machine but that is not the idea.

bcutter commented 1 week ago

Did you have a look at https://github.com/home-assistant/operating-system/issues/3362?

There's a workaround for USB related issues for Raspberry Pi boards, the dtoverlay=dwc2 hack (https://github.com/home-assistant/operating-system/issues/3362#issuecomment-2113334968). Maybe give it a try?