justanotherariel commented 9 months ago

First of all, thank you for the integration and for going through all this effort to sniff out the API calls.

Setup

Appliance: Remeha Calenta Ace (Serial Number 204304B8)
Homeassistant: 2023.11.3
Remeha Integration: v0.1.13

Issue

I have an issue regarding unavailability. Every once in a while, I notice that my Remeha climate entity becomes unavailable. I recently set up a template sensor that tracks when the boiler is heating (to ensure that the schedule and set temperature are working correctly), which made this problem even more apparent. Looking at the logs, approximately once every 5 minutes, my Remeha entity becomes unavailable for about a minute. This doesn't make the integration unusable, but it is annoying.

Log

This is the log output with debug mode enabled:

2023-11-29 21:47:19.575 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Requested technical information for appliance 0d83566f-3851-4451-b269-1148ccbd69cc: {'applianceName': 'Calenta Ace', 'internetConnectedGateways': [{'name': 'RU2', 'serialNumber': '204304B8', 'softwareVersion': '2.23.5', 'hardwareVersion': '7', 'controlledZoneName': None}]}
2023-11-29 21:47:19.575 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 2.034 seconds (success: True)
2023-11-29 21:47:19.576 INFO (MainThread) [homeassistant.components.binary_sensor] Setting up binary_sensor.remeha_home
2023-11-29 21:47:19.577 INFO (MainThread) [homeassistant.components.climate] Setting up climate.remeha_home
2023-11-29 21:47:19.577 INFO (MainThread) [homeassistant.components.sensor] Setting up sensor.remeha_home
2023-11-29 21:47:19.578 INFO (MainThread) [homeassistant.components.switch] Setting up switch.remeha_home
2023-11-29 21:48:19.556 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 0.407 seconds (success: True)
2023-11-29 21:49:19.639 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 0.490 seconds (success: True)
2023-11-29 21:50:19.440 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 0.291 seconds (success: True)
2023-11-29 21:51:19.622 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 0.473 seconds (success: True)
2023-11-29 21:52:20.071 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 0.922 seconds (success: True)
2023-11-29 21:53:20.671 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 0.522 seconds (success: True)
2023-11-29 21:54:38.166 ERROR (MainThread) [custom_components.remeha_home.coordinator] Error fetching remeha_home data: 
2023-11-29 21:54:38.166 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 18.017 seconds (success: False)
2023-11-29 21:56:00.727 INFO (MainThread) [custom_components.remeha_home.coordinator] Fetching remeha_home data recovered
2023-11-29 21:56:00.727 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 22.578 seconds (success: True)
2023-11-29 21:57:16.619 ERROR (MainThread) [custom_components.remeha_home.coordinator] Error fetching remeha_home data: 
2023-11-29 21:57:16.619 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 16.471 seconds (success: False)
2023-11-29 21:58:46.150 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 30.002 seconds (success: False)
2023-11-29 21:59:46.481 INFO (MainThread) [custom_components.remeha_home.coordinator] Fetching remeha_home data recovered
2023-11-29 21:59:46.481 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 0.333 seconds (success: True)
2023-11-29 22:01:03.196 ERROR (MainThread) [custom_components.remeha_home.coordinator] Error fetching remeha_home data: 
2023-11-29 22:01:03.196 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 17.047 seconds (success: False)
2023-11-29 22:02:05.257 INFO (MainThread) [custom_components.remeha_home.coordinator] Fetching remeha_home data recovered
2023-11-29 22:02:05.257 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 2.109 seconds (success: True)
2023-11-29 22:03:05.669 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 0.520 seconds (success: True)
2023-11-29 22:03:40.963 DEBUG (MainThread) [custom_components.remeha_home.climate] Setting temperature to 14.500000
2023-11-29 22:04:05.397 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 23.678 seconds (success: True)
2023-11-29 22:05:21.300 ERROR (MainThread) [custom_components.remeha_home.coordinator] Error fetching remeha_home data: 
2023-11-29 22:05:21.301 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 16.153 seconds (success: False)
2023-11-29 22:06:22.381 INFO (MainThread) [custom_components.remeha_home.coordinator] Fetching remeha_home data recovered
2023-11-29 22:06:22.382 DEBUG (MainThread) [custom_components.remeha_home.coordinator] Finished fetching remeha_home data in 1.233 seconds (success: True)

Conclusion

Sadly, the log isn't verbose and doesn't tell me exactly what went wrong. I'm assuming some timeout, as the data fetching only fails after quite a long time (15s or more), although the timeout for self.api.async_get_dashboard() seems to be set at 30s. So I'm not sure.

Please let me know how I can further debug this problem. You may already know why this is happening to me.

kayaozalp commented 9 months ago

I am having similar issues over a week now. Yesterday to the point that integration was reset and asked me to relogin.

From this point onwards, it is my speculation:

I suspect due to cold weather in the Netherlands, there is significant usage in the backend, the api is having stability issues, and/or as you mentioned higher than timeout response times.
While the official apps only check the API adhoc, our integration is querying the api much more frequently.
I suspect in order to stabilise their API, they might be throttling the API queries hence the flakiness of the app,
and reseting tokens which has caused my logout.

Given the cold weather and reliability need, while I appreciate this integration a lot, I will go towards a different route: [OTGW] OpenTherm gateway Purchased and setup, but haven't installed yet, waiting for a break from the cold, or full break of the integration to make the switch.

Hope we get to the bottom of this issue.

Stay warm!

justanotherariel commented 9 months ago

That would certainly be the ideal solution, but I don't want to spend the time and money on a setup I need to leave behind in one year (I'm only renting this apartment). But it's good to know! Who knows, my next apartment may have a similar setup.

Regarding your speculation. Yea I that would make sense. I don't mind it being slow to be honest, but I would like it to be robust. Maybe our backend can try to contact the API server and if it doesn't receive anything within a certain time span, it will use the last values? For transparency, we could add a sensor which shows the last time a connection was successful, only marking the entity as unavailable after a significant time has passed (10 minutes). That would pretty much solve my issue.

justanotherariel commented 9 months ago

So I looked into it and I'm getting Rate Limited.

Status: 429, Reason: Too Many Requests
Content: {'statusCode': 429, 'message': 'Rate limit is exceeded. Try again in 1 seconds.'}

I solved it for now by increasing the update interval to 3 minutes. This is not a permanent solution though. The developers could increase the rate limiting again, thus ideally, the api.py should recognize the rate limiting by itself and adapt the value dynamically.

Also, if the api.py is being rate limited, and one tries to set the temperature in the climate entity, it fails badly. First, the handling crashes (sadly I forgot to save the log) - spamming the main.log with crash traces. Secondly, it retries after only a couple of ms, instead of wating for a longer period.

As you can also see above in the log, the API does return the time until we can try again. I'm not sure how accurate it is, but it could make sense to wait at least the time it mentions.

I have looked into the code, and honestly I don't know how I would implement it with the DataUpdateCoordinator. It doesn't seem possible to "skip" an update easily (which we would need to when dealing with rate limits), as we always need to return an object with the relevent updated information. Not doing so would update all entities with empty data. Thus, fixing this issue would require a different approach, probably by doing self-managed (instead of homeassistant polling)updates.

msvisser commented 9 months ago

Unfortunately this seems to be the reality when dealing with an API which was not really meant for this use. I'm personally also experiencing timeouts now and then, but I also see that it works for days without any errors. So, I don't think we are consistently exceeding the rate limit, but still it is not the best. As for dealing with the timeouts, it becomes quite complex as Home Assistant does not really facilitate this, as you mentioned.

I have not seen any crashes like you mentioned, but if you can collect a trace it would be very helpful in solving those problems.

justanotherariel commented 9 months ago

After debugging a bit more, I found out that an integration I was using (better thermostat) made a bunch of unnessecary requests which drove up the API usage. I have disabled it now, which seems to have fixed the issue. Thus I will close this issue. Nevertheless, robustness of this integration is something to keep an eye on.

kayaozalp commented 9 months ago

For what it is worth, I don't think it was your other app. Probably when winter hit, they weren't really scaled to tackle the load, and so first thing, implement rate limit. I suspect now they have beefed up, and in order to reduce the rejected calls, increased the timeout as well.

msvisser / remeha_home

Flaky API Responses #44

Setup

Issue

Log

Conclusion