home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.81k stars 30.9k forks source link

Home Assistant boot process gets stuck while loading Rainforest RAVEn integration. #113473

Open rehanson opened 8 months ago

rehanson commented 8 months ago

The problem

Home Assistant got stuck in the startup process while loading the Rainforest RAVEn integration. While it was stuck, I wasn't able to access the Overview dashboard because Home Assistant said it was still booting up (I could still navigate to menus and custom dashboards).

Sequence of events:

What version of Home Assistant Core has the issue?

core-2024.3.1

What was the last working version of Home Assistant Core?

core-2024.3.0

What type of installation are you running?

Home Assistant OS

Integration causing the issue

rainforest_raven

Link to integration documentation on our website

https://www.home-assistant.io/integrations/rainforest_raven/

Diagnostics information

config_entry-rainforest_raven-2a01a9ec98f5f6849b766186260f7c24(1).json

Example YAML snippet

No response

Anything in the logs that might be useful for us?

While bootup was "stuck" there were many repeats of this message:
...
2024-03-14 16:55:11.210 WARNING (MainThread) [homeassistant.bootstrap] Waiting on integrations to complete setup: rainforest_raven
...

------log below occurs around the time I unplugged the EMU-2 USB cable and bootup became "unstuck"------

2024-03-14 16:58:11.355 WARNING (MainThread) [homeassistant.bootstrap] Waiting on integrations to complete setup: rainforest_raven
2024-03-14 16:58:18.859 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:58:18.993 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.
2024-03-14 16:58:22.254 ERROR (MainThread) [homeassistant] Error doing job: Fatal write error on serial transport
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 621, in write
    n = os.write(self.fd, d)
        ^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 5] I/O error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/serial_asyncio/__init__.py", line 261, in _write_ready
    n = self._serial.write(data)
        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 655, in write
    raise SerialException('write failed: {}'.format(e))
serial.serialutil.SerialException: write failed: [Errno 5] I/O error
2024-03-14 16:58:22.292 WARNING (MainThread) [homeassistant.helpers.system_info] No Home Assistant Supervisor info available
2024-03-14 16:58:34.998 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:58:35.134 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.
2024-03-14 16:58:51.141 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:58:51.277 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.
2024-03-14 16:59:07.285 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:59:07.415 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.
2024-03-14 16:59:23.420 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:59:23.553 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.

Additional information

Even while the Rainforest RAVEn integration is running normally, there are frequent occurrences of "Unavailable" states. This causes the history graph to have many gaps, and dashboards showing the EMU-2 status will often show "Unavailable" instead of the last known value. image

home-assistant[bot] commented 8 months ago

Hey there @cottsay, mind taking a look at this issue as it has been labeled with an integration (rainforest_raven) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `rainforest_raven` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign rainforest_raven` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


rainforest_raven documentation rainforest_raven source (message by IssueLinks)

cottsay commented 8 months ago

Thanks for the report, @rehanson. You mentioned that you're using a VM. How is the EMU-2 forwarded from the host to the VM? As a USB device or a serial device?

cottsay commented 8 months ago

Ah, I can see from the diagnostics that it appears in the VM as a USB device.

There were some handle leaks in 2024.2.0-2024.3.0 that I believe were causing a multitude of problems, including the gaps in the history. From what I can tell, the leaked handles would accumulate to the point where Home Assistant was no longer able to maintain a connection at all, resulting in bugs like #111886.

I also observed some problems in my VM after letting 2024.3.0 run to the point where the handle leaks saturated and needed to disconnect and reconnect the device, though I haven't been able to reproduce the problem now that the handle leaks are fixed. Please keep me updated with any changes in behavior.

rehanson commented 7 months ago

@cottsay I just updated to Home Assistant Core 2024.3.3 (from 2024.3.1). This update included the aioraven bump to 0.5.2. I had no issues getting stuck during the boot process after the update. And I haven't had any gaps in the history like I used to get. I've been running for ~3hrs with no issues yet. Before the update, I'd have about one "Unavailable" gap per minute.

cottsay commented 7 months ago

Thank you for reporting back. Glad to hear things are looking better.

We can keep this open for another update cycle or so, but if nobody is seeing it any longer then I'm willing to bet that my initial instinct was right and that qemu was misbehaving after the guest leaked all those unclosed handles. Seems a full VM shutdown or device disconnect are the only ways to get back to a good state when that happens.

It's been a day now. Any errors in the logs? Any chance #113539 stopped occurring too?

rehanson commented 7 months ago

In almost two days of running the latest update, I haven't had any gaps in my history! https://github.com/home-assistant/core/issues/113539 is still occurring however. I will add more comments over on that issue.

EDIT: There were actually four gaps. See a couple of comments down.

rehanson commented 7 months ago

Oh, and I haven't really rebooted enough times to see if the boot process still gets stuck, but based on your explanation, I doubt I will encounter it again.

rehanson commented 7 months ago

Now that I have looked closer at the logs, it did timeout four times (much less of an issue than before!) in the last two days. And I can see 30 second gaps in the history graph at those four points in time.

Logger: homeassistant.components.rainforest_raven.coordinator
Source: helpers/update_coordinator.py:324
integration: Rainforest RAVEn (documentation, issues)
First occurred: March 26, 2024 at 6:48:18 AM (4 occurrences)
Last logged: 2:07:33 AM

Timeout fetching rainforest_raven data

Here are those four occurrences:

2024-03-26 06:48:18.388 ERROR (MainThread) [homeassistant.components.rainforest_raven.coordinator] Timeout fetching rainforest_raven data
2024-03-26 09:59:53.361 ERROR (MainThread) [homeassistant.components.rainforest_raven.coordinator] Timeout fetching rainforest_raven data
2024-03-27 02:05:28.408 ERROR (MainThread) [homeassistant.components.rainforest_raven.coordinator] Timeout fetching rainforest_raven data
2024-03-27 02:07:33.359 ERROR (MainThread) [homeassistant.components.rainforest_raven.coordinator] Timeout fetching rainforest_raven data
cottsay commented 7 months ago

Those are probably legitimate timeouts now. There could be dropped or corrupted messages over the serial link or the EMU-2's wireless link to the meter. There's a lot that can go wrong here.

It's too bad that the data reflects the timeout with a gap like that. We don't want to present stale data, so the only thing I can think to do is to retry the query once or twice. Would it be worth it to add that logic? How much of a problem are the gaps in the data?

rehanson commented 7 months ago

With how rare the timeouts are now, it's not really a problem for me.

rehanson commented 7 months ago

I just ran into the stuck bootup issue again today during a reboot of my Proxmox server (Home Assistant runs as a VM on the server). I am running Core 2024.3.3

Logger: homeassistant.bootstrap
Source: bootstrap.py:601
First occurred: 1:59:34 PM (3 occurrences)
Last logged: 2:01:34 PM

Waiting on integrations to complete setup: rainforest_raven

I was only able to complete HA startup by unplugging the Rainforest EMU-2 meter.

cottsay commented 7 months ago

I just ran into the stuck bootup issue again today...

Rats, thanks for the feedback. I assume this was an update from 2024.3.3 to 2024.4.0?

rehanson commented 7 months ago

Nope, I hadn't yet updated to 2024.4.0 when it happened. (I am currently updating now)

issue-triage-workflows[bot] commented 4 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

rehanson commented 4 months ago

My system still hangs up occasionally when trying to boot. It shows a message about waiting for the RAVEn integration to start.

It most recently happened on 2024.6.4

cottsay commented 3 months ago

FYI @rehanson, 2024.7.3 included changes to the way rainforest_raven handles initialization errors, and the underlying serial library was updated with changes to device closure that are also in this space.

Since I still can't reproduce these problems locally I can't confirm that they will solve or mitigate this problem, but either of those changes could have changed the behavior. Let me know if you see anything different.

rehanson commented 3 months ago

Thanks @cottsay, I'll report back if I see the issue again. (And I'll try to report back in a month or so if I don't see it again)

tsaro commented 3 months ago

I saw this issue again today after upgrading to hassos 13.0, I had reverted the previous update (I think to 12.3 from 12.4) due to this integration not starting. The integration ran fine with the older hassos. After I saw the hang on 13.0 I tried unplugging the rainforest device and re-plugging and that seemed to work, startup completed quickly and I was getting power data. I did a second full reboot to test and had the same symptoms. Restarting just HA after this (no reboot) and everything came up normally so this seems to only occur on full OS reboots for me.

hassos: "13.0" homeassistant: 2024.8.1 machine: raspberrypi4-64

tsaro commented 2 months ago

I'm still seeing the same behavior after upgrading hassos 13.1, unplugging the rainforest device after a full OS reboot restored data, no issues with subquent HA restarts

rehanson commented 2 months ago

I just ran into this again after upgrading to HA OS 13.1. The initial reboot after upgrading stalled without me being able to get to the Web UI, so I rebooted again, was able to get to the Web UI, but with the dreaded message again: Starting Rainforest RAVEn, not everything will be available until it is finished.

It finished booting when I unplugged the EMU-2.

Core: 2024.8.2

rossica commented 1 month ago

I have also hit this, on Home Assistant 2024.10.1, running in docker, with the EMU-2 exposed into the docker container as /dev/ttyACM0. The EMU-2 is actually connected to another raspberry pi, forwarding it over the network via usbip. I went to test the power loss recovery on the remote device, and when it came back online, understandably, RAVEn wasn't connected, so I restarted the HASS container and hit this issue on startup.

tsaro commented 1 month ago

just upgraded to hassos 13.2 and same issue/workaround, hung at the Rainforest integration trying to start (waited 10 minutes before unplugging and re-plugging USB, then data started flowing again)

mdeneen commented 1 month ago

I started running into this as well. I had to remove the device from the VM, add it again, and restart the RAVEn integration.