home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.14k stars 29.81k forks source link

lutron_caseta BRIDGE_TIMEOUT too short #110067

Open jonoberheide opened 7 months ago

jonoberheide commented 7 months ago

The problem

Hi! I have a large Lutron QSX system. The default BRIDGE_TIMEOUT of 35 seconds in components/lutron_caseta/const.py is not sufficient to perform all the required LEAP queries to connect to the bridge and parse all the response data.

A couple ideas for resolution:

(1) Increase the timeout.

(2) Make the timeout configurable during integration setup, so that it's easier to tweak for large systems.

(3) Separate the connect timeout from a retrieval/parsing timeout to help debug initial connection attempt quickly (network-level connectivity, proper Lutron creds, etc) while providing a more generous timeout for the second part.

Naively editing and increasing the const.py timeout on the filesystem allowed the integration to load up successfully!

Also, a PR for API paging support in the upstream pylutron_caseta package is also needed to integrate: https://github.com/gurumitts/pylutron-caseta/pull/161

What version of Home Assistant Core has the issue?

2024.2.0

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

lutron_caseta

Link to integration documentation on our website

https://www.home-assistant.io/integrations/lutron_caseta/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

home-assistant[bot] commented 7 months ago

Hey there @swails, @bdraco, @danaues, mind taking a look at this issue as it has been labeled with an integration (lutron_caseta) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `lutron_caseta` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign lutron_caseta` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


lutron_caseta documentation lutron_caseta source (message by IssueLinks)

bdraco commented 7 months ago

(3) Separate the connect timeout from a retrieval/parsing timeout to help debug initial connection attempt quickly (network-level connectivity, proper Lutron creds, etc) while providing a more generous timeout for the second part.

I think this makes the most sense. I added the timeout in #45769 because it would block startup if the bridge was offline. Its actually quite a bit higher than we would like as HA will throw a warning if it takes more than 10s to startup the integration.

We probably only need to wait ~5-6s (maybe 8-9 even) for the connection to establish, and than it would be fine to wait ~50s for the communication (I hope it wouldn't ever take longer than that) as I believe there is another warning that kicks in at 60s.

That's going to require a change to pylutron_caseta as well. Consider this comment a ๐Ÿ‘ for accepting a PR in HA to that effect.

bdraco commented 6 months ago

There is a chance that the cause was the event loop being blocked if the system was I/O or CPU bound loading python code. We load more code in the executor in 2024.3.x so there is a chance it won't be needed to increase the timeout anymore.

jonoberheide commented 6 months ago

@bdraco Sounds good. I will try with 2024.3 when it's out.

jonoberheide commented 6 months ago

@bdraco FWIW, 2024.3 unfortunately does not resolve the issue.

bdraco commented 6 months ago

We found a lot more places where integrations were doing blocking I/O in the event loop in the last few days. 2024.4 will fix a lot more of those places so if its an event loop being blocked problem, it will probably be solved. If the device actually does take that long to give back the data, it will need a code change.

jonoberheide commented 6 months ago

I'm assuming the latter, as there are loads of LEAP calls being made to parse the whole QSX home. There's not much else going on in my HA and it's on pretty beefy hardware.

Just pulling a quick debug log while "reloading" the integration, it looks like the initialization takes about 42 seconds and is making ~800 LEAP calls to the QSX processor.

Maybe bumping the timeout to 60 is naive, but might make this go away? ๐Ÿ˜†

bdraco commented 6 months ago

Another thing is that the pylutroncaseta library uses the built-in, json parser, which is very very slow. We could switch it to use orjson like everything in Home Assistant does. But I'm speculating on what the problem is.

jonoberheide commented 6 months ago

Oh I really like that idea. I'll try that and see what impact it has.

jonoberheide commented 6 months ago

Eh, looks like little to no impact. I suppose I shouldn't be surprised as it's far from a CPU-bound task with all that network activity.

bdraco commented 6 months ago

I haven't verified this but it looks like the design of the library currently isn't very asynchronous, and it could probably be refactored to use its own asyncio protocol instead of using streams and be so much faster

bdraco commented 6 months ago

The streams do add a lot of overhead but I don't think thats the problem

bdraco commented 6 months ago

The whole login process has quite a sync design. I think we could send all the requests and wait for them to come back instead

bdraco commented 6 months ago

https://github.com/gurumitts/pylutron-caseta/pull/163 might solve the issue but I don't have any of the newer Lutron stuff. It works great with my Caseta and RA2 select systems

bdraco commented 6 months ago

All the tests are likely failing on that PR as they assume everything comes in synchronously

issue-triage-workflows[bot] commented 3 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment ๐Ÿ‘ This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

jonoberheide commented 3 months ago

Still an active issue