Open martymarty004 opened 3 months ago
I don't think it needs to be "the entire USB stack" to fail, all it needs is the USB drive to fail, then the behavior would be similar as you describe. Also, if there was a systematic problem of the USB stack, I'm pretty sure there would be more reports of this happening. Relocating the logs to a different media isn't an easy task, generally, on RPi it's better to use the data disk setup which would prevent this from happening - even if the data partition became inaccessible, the system would be still running and you could SSH into it (at least using the developer SSH on port 22222).
I know that having the system running on the internal MicroSD would be better, but I had too many cases of dead cards in my life, so I decided to move everything to the external drive.
The thing that seems strange to me is how all usb devices seem to stop working at the same time (I stop getting data from the UPS and zigbee devices go offline). HomeKit is maybe the service that lasts the longer (and the reason why nobody notices anything is wrong at first).
The web UI is one of the first things that stops working, followed by the debug ssh, but without access to the disk, there is plenty of space for bizarre behavior.
Still, I'd like to save the logs to another drive in order to see what's going on.
I have the same problem on a rpi 3B+, I disconnect the power and start again, the boot seems normal in sequence without errors, when the system finishes loading the web interface loses the connection, restarts again but fails to boot. I changed the power supply, tried with another ssd, changed the usb ssd adapter, changed the network cable. without result. I migrated my server to a virtual machine but that is not the idea.
Did you have a look at https://github.com/home-assistant/operating-system/issues/3362?
There's a workaround for USB related issues for Raspberry Pi boards, the dtoverlay=dwc2
hack (https://github.com/home-assistant/operating-system/issues/3362#issuecomment-2113334968). Maybe give it a try?
Describe the issue you are experiencing
In the latest months, my HA instance running on a RPi 4 Model B (4GB RAM, using USB SSD) was crashing at random times, while having some strange behavior. Initially I blamed the PSU, but after replacing it, the instance crashed just a few hours later.
Every time the server crashes, there is a small chance that the watchdog forces a reboot, but most of the time I come home to a dead RPi with the activity LED (both the one on the board and the one on the SSD) flashing with constant timing.
It was only after a few months that I noticed that the instance would start to slowly break way before it crashed completely: the homepage was fine, but maybe some pages were broken, and trying to do something different would often cause a reboot.
I finally had a revelation a week ago, when I configured the automation for when my Zigbee devices go offline. I got notified that all my devices were offline, like if coordinator was dead. After checking, I noticed that the coordinator had become unavailable, and it was the same for the UPS connected via USB. When I checked the RPi, the SSD activity light was flashing constantly. A few minutes later, the server managed to force a reboot.
From what I saw, I believe that the entire USB stack is crashing somehow, causing ZHA and NUT to fail, while removing access to the main disk, probably causing all the unexpected behavior.
At this point, I don't know how to further analyze the issue, because the system is way too broken in that state: SSH connections are refused and nothing happens when I plug in a monitor to see what's going on, and of course, the logs don't provide any useful information, since they can't be written to disk after the issue start.
The only thing I thought about in order to help you diagnose the problem was to maybe write the logs on the unused MicroSD so we are able to retrieve them, but I don't know how I should do that.
What operating system image do you use?
rpi4-64 (Raspberry Pi 4/400 64-bit OS)
What version of Home Assistant Operating System is installed?
13.1
Did the problem occur after upgrading the Operating System?
No
Hardware details
Main board: Raspberry Pi 4 Model B, 4GB RAM
USB devices:
ID: /dev/disk/by-id/ata-SATA_SSD_MX_00000000000000019
DEVLINKS: >- /dev/disk/by-diskseq/28 /dev/disk/by-id/ata-SATA_SSD_MX_00000000000000019 /dev/disk/by-id/usb-SSK_SSK_Storage_DD564198838FE-0:0 /dev/disk/by-id/wwn-0x50000000000028b9 /dev/disk/by-path/platform-fd500000.pcie-pci-0000:01:00.0-usb-0:1:1.0-scsi-0:0:0:0 /dev/disk/by-path/platform-fd500000.pcie-pci-0000:01:00.0-usbv3-0:1:1.0-scsi-0:0:0:0 DEVNAME: /dev/sda DEVPATH: >- /devices/platform/scb/fd500000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/usb2/2-1/2-1:1.0/host0/target0:0:0/0:0:0:0/block/sda DEVTYPE: disk DISKSEQ: '28' ID_ATA: '1' ID_ATA_DOWNLOAD_MICROCODE: '1' ID_ATA_FEATURE_SET_PM: '1' ID_ATA_FEATURE_SET_PM_ENABLED: '1' ID_ATA_FEATURE_SET_SECURITY: '1' ID_ATA_FEATURE_SET_SECURITY_ENABLED: '0' ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN: '2' ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN: '2' ID_ATA_FEATURE_SET_SMART: '1' ID_ATA_FEATURE_SET_SMART_ENABLED: '1' ID_ATA_PERIPHERAL_DEVICE_TYPE: '0' ID_ATA_ROTATION_RATE_RPM: '0' ID_ATA_SATA: '1' ID_ATA_SATA_SIGNAL_RATE_GEN1: '1' ID_ATA_SATA_SIGNAL_RATE_GEN2: '1' ID_ATA_WRITE_CACHE: '1' ID_ATA_WRITE_CACHE_ENABLED: '1' ID_BUS: ata ID_MODEL: SATA_SSD ID_MODEL_ENC: >- SATA\x20SSD\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20 ID_PART_TABLE_TYPE: gpt ID_PART_TABLE_UUID: 78dfb754-d33b-4312-8b24-f9e9c1a5b65c ID_PATH: platform-fd500000.pcie-pci-0000:01:00.0-usb-0:1:1.0-scsi-0:0:0:0 ID_PATH_TAG: platform-fd500000_pcie-pci-0000_01_00_0-usb-0_1_1_0-scsi-0_0_0_0 ID_PATH_WITH_USB_REVISION: platform-fd500000.pcie-pci-0000:01:00.0-usbv3-0:1:1.0-scsi-0:0:0:0 ID_REVISION: H220614a ID_SERIAL: SATA_SSD_MX_00000000000000019 ID_SERIAL_SHORT: MX_00000000000000019 ID_TYPE: disk ID_USB_DRIVER: uas ID_USB_INSTANCE: '0:0' ID_USB_INTERFACES: ':080650:080662:' ID_USB_INTERFACE_NUM: '00' ID_USB_MODEL: SSK_Storage ID_USB_MODEL_ENC: SSK\x20Storage ID_USB_MODEL_ID: '0580' ID_USB_REVISION: '7601' ID_USB_SERIAL: SSK_SSK_Storage_DD564198838FE-0:0 ID_USB_SERIAL_SHORT: DD564198838FE ID_USB_TYPE: disk ID_USB_VENDOR: SSK ID_USB_VENDOR_ENC: SSK\x20\x20\x20\x20\x20 ID_USB_VENDOR_ID: 152d ID_WWN: '0x50000000000028b9' ID_WWN_WITH_EXTENSION: '0x50000000000028b9' MAJOR: '8' MINOR: '0' SUBSYSTEM: block TAGS: ':systemd:' USEC_INITIALIZED: '2857679'
Anything in the Host logs that might be useful for us?
System information
System Information
Home Assistant Community Store
GitHub API | ok -- | -- GitHub Content | ok GitHub Web | ok HACS Data | ok GitHub API Calls Remaining | 5000 Installed Version | 2.0.0 Stage | running Available Repositories | 1388 Downloaded Repositories | 13Home Assistant Cloud
logged_in | false -- | -- can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | okHome Assistant Supervisor
host_os | Home Assistant OS 13.1 -- | -- update_channel | stable supervisor_version | supervisor-2024.08.0 agent_version | 1.6.0 docker_version | 26.1.4 disk_total | 116.7 GB disk_used | 12.4 GB healthy | true supported | true host_connectivity | true supervisor_connectivity | true ntp_synchronized | true virtualization | board | rpi4-64 supervisor_api | ok version_api | ok installed_addons | File editor (5.8.0), chrony (3.0.1), Terminal & SSH (9.14.0), Mosquitto broker (6.4.1), Duck DNS (1.18.0), AdGuard Home (5.1.1), Home Assistant Google Drive Backup (0.112.1), Vaultwarden (Bitwarden) (0.22.0), Log Viewer (0.17.0), Samba share (12.3.2), Network UPS Tools (0.13.1), WireGuard (0.10.2), Glances (0.21.1), UniFi Network Application (3.2.0), Nginx Proxy Manager (1.0.1), MariaDB (2.7.1), Node-RED (18.0.5), Vaultwarden (Bitwarden) (0.3.1), DHCP server (1.4.0), ESPHome (2024.8.1)Dashboards
dashboards | 2 -- | -- resources | 11 views | 5 mode | storageRecorder
oldest_recorder_run | 26 agosto 2024 alle ore 12:23 -- | -- current_recorder_run | 2 settembre 2024 alle ore 02:15 estimated_db_size | 333.61 MiB database_engine | sqlite database_version | 3.45.3Additional information
No response