Open jonoberheide opened 7 months ago
Hey there @swails, @bdraco, @danaues, mind taking a look at this issue as it has been labeled with an integration (lutron_caseta
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
lutron_caseta documentation lutron_caseta source (message by IssueLinks)
(3) Separate the connect timeout from a retrieval/parsing timeout to help debug initial connection attempt quickly (network-level connectivity, proper Lutron creds, etc) while providing a more generous timeout for the second part.
I think this makes the most sense. I added the timeout in #45769 because it would block startup if the bridge was offline. Its actually quite a bit higher than we would like as HA will throw a warning if it takes more than 10s to startup the integration.
We probably only need to wait ~5-6s (maybe 8-9 even) for the connection to establish, and than it would be fine to wait ~50s for the communication (I hope it wouldn't ever take longer than that) as I believe there is another warning that kicks in at 60s.
That's going to require a change to pylutron_caseta
as well. Consider this comment a ๐ for accepting a PR in HA to that effect.
There is a chance that the cause was the event loop being blocked if the system was I/O or CPU bound loading python code. We load more code in the executor in 2024.3.x so there is a chance it won't be needed to increase the timeout anymore.
@bdraco Sounds good. I will try with 2024.3 when it's out.
@bdraco FWIW, 2024.3 unfortunately does not resolve the issue.
We found a lot more places where integrations were doing blocking I/O in the event loop in the last few days. 2024.4 will fix a lot more of those places so if its an event loop being blocked problem, it will probably be solved. If the device actually does take that long to give back the data, it will need a code change.
I'm assuming the latter, as there are loads of LEAP calls being made to parse the whole QSX home. There's not much else going on in my HA and it's on pretty beefy hardware.
Just pulling a quick debug log while "reloading" the integration, it looks like the initialization takes about 42 seconds and is making ~800 LEAP calls to the QSX processor.
Maybe bumping the timeout to 60 is naive, but might make this go away? ๐
Another thing is that the pylutroncaseta library uses the built-in, json parser, which is very very slow. We could switch it to use orjson like everything in Home Assistant does. But I'm speculating on what the problem is.
Oh I really like that idea. I'll try that and see what impact it has.
Eh, looks like little to no impact. I suppose I shouldn't be surprised as it's far from a CPU-bound task with all that network activity.
I haven't verified this but it looks like the design of the library currently isn't very asynchronous, and it could probably be refactored to use its own asyncio protocol instead of using streams and be so much faster
The streams do add a lot of overhead but I don't think thats the problem
The whole login process has quite a sync design. I think we could send all the requests and wait for them to come back instead
https://github.com/gurumitts/pylutron-caseta/pull/163 might solve the issue but I don't have any of the newer Lutron stuff. It works great with my Caseta and RA2 select systems
All the tests are likely failing on that PR as they assume everything comes in synchronously
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment ๐ This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.
Still an active issue
The problem
Hi! I have a large Lutron QSX system. The default BRIDGE_TIMEOUT of 35 seconds in components/lutron_caseta/const.py is not sufficient to perform all the required LEAP queries to connect to the bridge and parse all the response data.
A couple ideas for resolution:
(1) Increase the timeout.
(2) Make the timeout configurable during integration setup, so that it's easier to tweak for large systems.
(3) Separate the connect timeout from a retrieval/parsing timeout to help debug initial connection attempt quickly (network-level connectivity, proper Lutron creds, etc) while providing a more generous timeout for the second part.
Naively editing and increasing the const.py timeout on the filesystem allowed the integration to load up successfully!
Also, a PR for API paging support in the upstream pylutron_caseta package is also needed to integrate: https://github.com/gurumitts/pylutron-caseta/pull/161
What version of Home Assistant Core has the issue?
2024.2.0
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant OS
Integration causing the issue
lutron_caseta
Link to integration documentation on our website
https://www.home-assistant.io/integrations/lutron_caseta/
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
No response
Additional information
No response