home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
68.78k stars 28.1k forks source link

Homekit devices (Meross Thermostat MTS200) not responding anymore #116143

Open LeoCal opened 1 week ago

LeoCal commented 1 week ago

The problem

After upgrading to 2024.4.4, all my Meross Thermostat MTS200 integrated into HA via Homekit Device integration stopped working. From the logs, it seems they are not responding anymore. I've contacted Meross assistance to find ways to recover them and I executed a reboot of the devices as they suggested, but that did not work. I'm now considering using the Meross LAN integration from HACS, but I think it would be great to fix this Homekit Device problem if it's an issue inside the HA integration.

What version of Home Assistant Core has the issue?

core-2024.4.4

What was the last working version of Home Assistant Core?

core-2024.4.3

What type of installation are you running?

Home Assistant OS

Integration causing the issue

HomeKit Device

Link to integration documentation on our website

https://www.home-assistant.io/integrations/homekit_controller

Diagnostics information

home-assistant_homekit_controller_2024-04-25T05-50-45.170Z.log

Example YAML snippet

N/A, provisioned via UI.

Anything in the logs that might be useful for us?

2024-04-25 00:32:43.360 ERROR (MainThread) [aiohomekit.controller.ip.connection] MTS200-f759 [['192.168.0.90']:52432] (id=3B:D2:43:2D:67:FD): Unexpected error whilst trying to connect to accessory. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/aiohomekit/controller/ip/connection.py", line 639, in _reconnect
    return await self._connect_once()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohomekit/controller/ip/connection.py", line 735, in _connect_once
    await super()._connect_once()
  File "/usr/local/lib/python3.12/site-packages/aiohomekit/controller/ip/connection.py", line 603, in _connect_once
    connected_host = sock.getpeername()[0]
                     ^^^^^^^^^^^^^^^^^^
OSError: [Errno 107] Socket not connected
2024-04-25 00:32:43.377 ERROR (MainThread) [aiohomekit.controller.ip.connection] MTS200-fb00 [['192.168.0.91']:52432] (id=3E:B6:5D:25:83:DF): Unexpected error whilst trying to connect to accessory. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/aiohomekit/controller/ip/connection.py", line 639, in _reconnect
    return await self._connect_once()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohomekit/controller/ip/connection.py", line 735, in _connect_once
    await super()._connect_once()
  File "/usr/local/lib/python3.12/site-packages/aiohomekit/controller/ip/connection.py", line 603, in _connect_once
    connected_host = sock.getpeername()[0]
                     ^^^^^^^^^^^^^^^^^^
OSError: [Errno 107] Socket not connected
2024-04-25 00:32:59.860 WARNING (MainThread) [homeassistant.helpers.service] Referenced entities climate.termostato_soggiorno_thermostat are missing or not currently available
2024-04-25 00:33:01.078 WARNING (MainThread) [homeassistant.helpers.service] Referenced entities climate.termostato_soggiorno_thermostat are missing or not currently available
2024-04-25 00:33:01.589 WARNING (MainThread) [homeassistant.helpers.service] Referenced entities climate.termostato_soggiorno_thermostat are missing or not currently available
2024-04-25 00:33:01.807 WARNING (MainThread) [homeassistant.helpers.service] Referenced entities climate.termostato_soggiorno_thermostat are missing or not currently available
2024-04-25 00:33:01.993 WARNING (MainThread) [homeassistant.helpers.service] Referenced entities climate.termostato_soggiorno_thermostat are missing or not currently available

Additional information

No response

home-assistant[bot] commented 1 week ago

Hey there @jc2k, @bdraco, mind taking a look at this issue as it has been labeled with an integration (homekit_controller) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `homekit_controller` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign homekit_controller` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


homekit_controller documentation homekit_controller source (message by IssueLinks)

Jc2k commented 1 week ago

There were no HomeKit changes in that tag.

Can you downgrade and verify it starts working again?

LeoCal commented 1 week ago

Right, I’ve seen that there have been no changes in HomeKit recently.

However, could it be a latent issue in the code that manifested just recently?

Any clue from the logs I’ve gathered?

Jc2k commented 1 week ago

That's why I want you to downgrade, to make sure I don't waste time bisecting if it's actually being wonky a while and you just got lucky until now.

At a glance (won't be near a computer to look properly for a week) the errors look like it's a problem with the device. The entire TCP connection was dropped shortly after the connection started, before we even tried to talk to the device.

LeoCal commented 1 week ago

Ok, I just got home and I did the downgrade to 2024.4.3 as you suggested. I can confirm that my devices integrated into HA via "Homekit Devices" integration still have the same issue connecting to HA; so I guess it might be a latent problem that just decided to show up in 2024.4.4. Connection logs were exactly the same as the one I posted earlier, so I didn't bother gathering new ones before I upgraded back to 2024.4.4. Please let me know if you manage to understand from the logs what's going on here.

bdraco commented 1 week ago

I have this same thermostat. I have to power cycle it about every 6 months to keep it working

LeoCal commented 1 week ago

Oh, what a pain... do you think that's a problem with the device's implementation of Homekit or rather an issue with HA "Homekit Devices" integration?

For now, I've just disabled them all from "Homekit Devices" and I've moved to use the "Meross LAN" integration from HACS - let's see if this works any better.

bdraco commented 1 week ago

It did the same thing when I had it paired using iOS directly

LeoCal commented 1 week ago

Thanks for your feedback on this, I smell an issue with the Homekit implementation inside the device itself then.

FWIW, in the meantime everything seems to be working smoothly with "Meross LAN" integration, which is using local HTTP communication to the thermostat (as a backup strategy, they can also communicate via MQTT, either in Meross Cloud or local). It would be great to have this component integrated in HA itself if it proves to be stable enough in the long term.

bdraco commented 1 week ago

My experience is that the HomeKit implementation on the device crashes but the Meross api keeps working

r1si commented 1 week ago

I have the same problem with bticino gateway

r1si commented 1 week ago

Ok fixed. In my case is a problem with my Amazon eero update... When update is pending it close all out of standard ports... In general think you can reboot your router, device and hub

One important thing... Do not remove homekit connection if doesn't work... I spent last 3 hours to make a backup restore.

bdraco commented 1 week ago

So it seems that every time we poll the device is likely leaking memory.

I think we can skip polling in accessory mode if all chars are watchable, and the device is reachable via zeroconf

LeoCal commented 1 week ago

Very interesting as I just got a response from Meross support that they think it's an issue with HomeKit from Apple instead. I'm pretty sure it's the homekit implementation on the MTS200 instead, but I'd need to back this hypothesis with data.

How did you check the device is leaking memory? Did you find a way to SSH into the device? Would you mind sharing how can I access the device? (or any link, wiki, etc)

My idea is to respond to the Meross support asking to involve the engineering team so that they can fix the actual code on their product.

bdraco commented 1 week ago

I'm pretty sure it's the homekit implementation on the MTS200 instead, but I'd need to back this hypothesis with data.

How did you check the device is leaking memory?

Its behaving like a classic memory leak, but its just an assumption that I can't verify since there doesn't appear to be a way to access the device stack without hacking the firmware.

Mine runs for about 2-3 months, than the device will still ping, but the homekit webserver gives connection refused and stops responding until I power cycle it. I can still control it in the meross app, but the homekit functionality is dead until I flip the breaker and restart the device.

bdraco commented 1 week ago

I expect each poll request that Home Assistant does gets it closer to crashing and it takes about 3 months to run out of memory (or whatever other resource is leaking). https://github.com/home-assistant/core/pull/116200 will reduce the polling so it will probably take a lot longer for the HomeKit stack on the device to crash but its won't fix the underlying problem in the device firmware.

bdraco commented 1 week ago

Sorry it looks like we need to do some significant refactoring to have a way to check if the A/AAAA records are still alive before we can come up with a solution here

I closed https://github.com/home-assistant/core/pull/116200 as it was discovered our current implementation of async_find will cache the discovery forever which means while it will probably solve the issue here, but it will cause a regression where we will never see the device as offline until you try to interact with it

bdraco commented 1 week ago

I started working on a new approach in https://github.com/Jc2k/aiohomekit/pull/370

bdraco commented 1 week ago
Apr 26 16:57:50 homeassistant homeassistant[557]: 2024-04-26 11:57:50.534 DEBUG (MainThread) [homeassistant.components.homekit_controller.connection] Accessory is reachable, skip polling: 59:10:FD:E7:E1:0C
Apr 26 16:57:50 homeassistant homeassistant[557]: 2024-04-26 11:57:50.592 DEBUG (MainThread) [homeassistant.components.homekit_controller.connection] Accessory is reachable, skip polling: D7:F7:1B:0B:54:F0
Apr 26 16:57:50 homeassistant homeassistant[557]: 2024-04-26 11:57:50.680 DEBUG (MainThread) [homeassistant.components.homekit_controller.connection] Accessory is reachable, skip polling: 52:70:89:CC:35:98
Apr 26 16:57:51 homeassistant homeassistant[557]: 2024-04-26 11:57:51.029 DEBUG (MainThread) [homeassistant.components.homekit_controller.connection] Accessory is reachable, skip polling: 6F:C4:03:33:72:37

new version still fixes the issue