home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
69.91k stars 29k forks source link

Home Assistant boot process gets stuck while loading Rainforest RAVEn integration. #113473

Open rehanson opened 3 months ago

rehanson commented 3 months ago

The problem

Home Assistant got stuck in the startup process while loading the Rainforest RAVEn integration. While it was stuck, I wasn't able to access the Overview dashboard because Home Assistant said it was still booting up (I could still navigate to menus and custom dashboards).

Sequence of events:

What version of Home Assistant Core has the issue?

core-2024.3.1

What was the last working version of Home Assistant Core?

core-2024.3.0

What type of installation are you running?

Home Assistant OS

Integration causing the issue

rainforest_raven

Link to integration documentation on our website

https://www.home-assistant.io/integrations/rainforest_raven/

Diagnostics information

config_entry-rainforest_raven-2a01a9ec98f5f6849b766186260f7c24(1).json

Example YAML snippet

No response

Anything in the logs that might be useful for us?

While bootup was "stuck" there were many repeats of this message:
...
2024-03-14 16:55:11.210 WARNING (MainThread) [homeassistant.bootstrap] Waiting on integrations to complete setup: rainforest_raven
...

------log below occurs around the time I unplugged the EMU-2 USB cable and bootup became "unstuck"------

2024-03-14 16:58:11.355 WARNING (MainThread) [homeassistant.bootstrap] Waiting on integrations to complete setup: rainforest_raven
2024-03-14 16:58:18.859 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:58:18.993 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.
2024-03-14 16:58:22.254 ERROR (MainThread) [homeassistant] Error doing job: Fatal write error on serial transport
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 621, in write
    n = os.write(self.fd, d)
        ^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 5] I/O error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/serial_asyncio/__init__.py", line 261, in _write_ready
    n = self._serial.write(data)
        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 655, in write
    raise SerialException('write failed: {}'.format(e))
serial.serialutil.SerialException: write failed: [Errno 5] I/O error
2024-03-14 16:58:22.292 WARNING (MainThread) [homeassistant.helpers.system_info] No Home Assistant Supervisor info available
2024-03-14 16:58:34.998 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:58:35.134 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.
2024-03-14 16:58:51.141 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:58:51.277 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.
2024-03-14 16:59:07.285 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:59:07.415 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.
2024-03-14 16:59:23.420 WARNING (MainThread) [homeassistant.components.androidtv_remote] Disconnected from Living Room TV at 192.168.1.45
2024-03-14 16:59:23.553 ERROR (MainThread) [androidtvremote2] Device doesn't support sending keys. Try clearing the storage of the Android TV Remove Service system app. On the Android TV device, go to Settings > Apps > See all apps > Show system apps. Then, select Android TV Remote Service > Storage > Clear data/storage.

Additional information

Even while the Rainforest RAVEn integration is running normally, there are frequent occurrences of "Unavailable" states. This causes the history graph to have many gaps, and dashboards showing the EMU-2 status will often show "Unavailable" instead of the last known value. image

home-assistant[bot] commented 3 months ago

Hey there @cottsay, mind taking a look at this issue as it has been labeled with an integration (rainforest_raven) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `rainforest_raven` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign rainforest_raven` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


rainforest_raven documentation rainforest_raven source (message by IssueLinks)

cottsay commented 3 months ago

Thanks for the report, @rehanson. You mentioned that you're using a VM. How is the EMU-2 forwarded from the host to the VM? As a USB device or a serial device?

cottsay commented 3 months ago

Ah, I can see from the diagnostics that it appears in the VM as a USB device.

There were some handle leaks in 2024.2.0-2024.3.0 that I believe were causing a multitude of problems, including the gaps in the history. From what I can tell, the leaked handles would accumulate to the point where Home Assistant was no longer able to maintain a connection at all, resulting in bugs like #111886.

I also observed some problems in my VM after letting 2024.3.0 run to the point where the handle leaks saturated and needed to disconnect and reconnect the device, though I haven't been able to reproduce the problem now that the handle leaks are fixed. Please keep me updated with any changes in behavior.

rehanson commented 3 months ago

@cottsay I just updated to Home Assistant Core 2024.3.3 (from 2024.3.1). This update included the aioraven bump to 0.5.2. I had no issues getting stuck during the boot process after the update. And I haven't had any gaps in the history like I used to get. I've been running for ~3hrs with no issues yet. Before the update, I'd have about one "Unavailable" gap per minute.

cottsay commented 3 months ago

Thank you for reporting back. Glad to hear things are looking better.

We can keep this open for another update cycle or so, but if nobody is seeing it any longer then I'm willing to bet that my initial instinct was right and that qemu was misbehaving after the guest leaked all those unclosed handles. Seems a full VM shutdown or device disconnect are the only ways to get back to a good state when that happens.

It's been a day now. Any errors in the logs? Any chance #113539 stopped occurring too?

rehanson commented 3 months ago

In almost two days of running the latest update, I haven't had any gaps in my history! https://github.com/home-assistant/core/issues/113539 is still occurring however. I will add more comments over on that issue.

EDIT: There were actually four gaps. See a couple of comments down.

rehanson commented 3 months ago

Oh, and I haven't really rebooted enough times to see if the boot process still gets stuck, but based on your explanation, I doubt I will encounter it again.

rehanson commented 3 months ago

Now that I have looked closer at the logs, it did timeout four times (much less of an issue than before!) in the last two days. And I can see 30 second gaps in the history graph at those four points in time.

Logger: homeassistant.components.rainforest_raven.coordinator
Source: helpers/update_coordinator.py:324
integration: Rainforest RAVEn (documentation, issues)
First occurred: March 26, 2024 at 6:48:18 AM (4 occurrences)
Last logged: 2:07:33 AM

Timeout fetching rainforest_raven data

Here are those four occurrences:

2024-03-26 06:48:18.388 ERROR (MainThread) [homeassistant.components.rainforest_raven.coordinator] Timeout fetching rainforest_raven data
2024-03-26 09:59:53.361 ERROR (MainThread) [homeassistant.components.rainforest_raven.coordinator] Timeout fetching rainforest_raven data
2024-03-27 02:05:28.408 ERROR (MainThread) [homeassistant.components.rainforest_raven.coordinator] Timeout fetching rainforest_raven data
2024-03-27 02:07:33.359 ERROR (MainThread) [homeassistant.components.rainforest_raven.coordinator] Timeout fetching rainforest_raven data
cottsay commented 3 months ago

Those are probably legitimate timeouts now. There could be dropped or corrupted messages over the serial link or the EMU-2's wireless link to the meter. There's a lot that can go wrong here.

It's too bad that the data reflects the timeout with a gap like that. We don't want to present stale data, so the only thing I can think to do is to retry the query once or twice. Would it be worth it to add that logic? How much of a problem are the gaps in the data?

rehanson commented 3 months ago

With how rare the timeouts are now, it's not really a problem for me.

rehanson commented 3 months ago

I just ran into the stuck bootup issue again today during a reboot of my Proxmox server (Home Assistant runs as a VM on the server). I am running Core 2024.3.3

Logger: homeassistant.bootstrap
Source: bootstrap.py:601
First occurred: 1:59:34 PM (3 occurrences)
Last logged: 2:01:34 PM

Waiting on integrations to complete setup: rainforest_raven

I was only able to complete HA startup by unplugging the Rainforest EMU-2 meter.

cottsay commented 3 months ago

I just ran into the stuck bootup issue again today...

Rats, thanks for the feedback. I assume this was an update from 2024.3.3 to 2024.4.0?

rehanson commented 3 months ago

Nope, I hadn't yet updated to 2024.4.0 when it happened. (I am currently updating now)

issue-triage-workflows[bot] commented 1 week ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

rehanson commented 1 week ago

My system still hangs up occasionally when trying to boot. It shows a message about waiting for the RAVEn integration to start.

It most recently happened on 2024.6.4