krahabb / meross_lan

Home Assistant integration for Meross devices
MIT License
452 stars 47 forks source link

Meross Devices Constantly Becoming Unavailable, Setup With MQTT via Custom Pairer App, Hacks Device Key #206

Open timnolte opened 2 years ago

timnolte commented 2 years ago

Version of the custom_component

v2.6.1

Configuration

I am using the Mosquitto MQTT Broker, which I use along with Zigbee2MQTT as well. I have a unique HA user setup based on the MAC address of each device and I've configured this with each device when using the Android Custom Pairer app.

Describe the bug

As visible in the History/Logbook I see can the devices constantly going unavailable, then back online. This seems to have gotten worse with each new release of HA since the 2022.8 releases. While trying to debug this I can't even get/download the diagnostic on any of the Meross devices. My devices that were successfully setup via MQTT seem not to be also connected via HTTP. I am unable to change the Connection Protocol to anything other than MQTT for some of the devices as there is also no Host Address and also no Device Key(since it was configured with the Hacks Mode). And there seems to be no way to correct this either. I have a few devices that are connected only via HTTP, which are HomeKit Compatible devices, and those ones can't communicate via MQTT it seems, and those devices also become unavailable. I didn't want to have to have any of my devices registered through the Meross Cloud which is why I went the route of configuring them through the Custom Pairer App but I still can't seem to find the right setup that gets my devices connected via both HTTP & MQTT.

I'm really at a loss as to what to do other than remove all of my Meross devices, and the integration, and start all over again with no choice but to have my devices all connected to the Meross Cloud.

Debug log

Trying to provide relevant log entries with some of the data masked. I have the full raw logs that I'd be happy to provide in a more secure way.

2022-08-17 20:24:09.830 WARNING (MainThread) [homeassistant.config_entries] Config entry 'Living Room Lights (mss510x)' for meross_lan integration not ready yet: MQTT unavailable; Retrying in background
2022-08-17 20:24:09.835 WARNING (MainThread) [homeassistant.config_entries] Config entry 'MQTT Hub' for meross_lan integration not ready yet: MQTT unavailable; Retrying in background
2022-08-17 20:24:09.839 WARNING (MainThread) [homeassistant.config_entries] Config entry 'Bathroom Fan (mss510x)' for meross_lan integration not ready yet: MQTT unavailable; Retrying in background
2022-08-17 20:24:11.696 DEBUG (MainThread) [custom_components.meross_lan] MerossHttpClient(192.168.12.71): HTTP Response ({"header":{"messageId":"ab0247af137e4fef8b9c667e68c14b6a","namespace":"Appliance.System.All","method":"GETACK","payloadVersion":1,"from":"/appliance/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/publish","timestamp":1660782251,"timestampMs":989,"sign":"********************************"},"payload":{"all":{"system":{"hardware":{"type":"mss550x","subType":"us","version":"4.0.0","chipType":"MT7686","uuid":"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX","macAddress":"xx:xx:xx:xx:xx:xx"},"firmware":{"version":"4.2.2","homekitVersion":"2.0.1","compileTime":"Sep 23 2021 17:21:34","encrypt":1,"wifiMac":"xx:xx:xx:xx:xx:xx","innerIp":"192.168.12.71","server":"192.168.12.6","port":8883,"userId":0},"time":{"timestamp":1660782251,"timezone":"","timeRule":[]},"online":{"status":0,"bindId":"","who":0}},"digest":{"togglex":[{"channel":0,"onoff":0,"lmTime":1660708156}],"triggerx":[],"timerx":[]}}}}
)
2022-08-17 20:24:11.697 DEBUG (MainThread) [custom_components.meross_lan] MerossDevice(XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX) back online!
2022-08-17 20:24:16.303 DEBUG (MainThread) [custom_components.meross_lan] MerossApi: MQTT RECV device_id:(XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX) method:(GETACK) namespace:(Appliance.System.All)
2022-08-17 20:24:16.304 DEBUG (MainThread) [custom_components.meross_lan] MerossDevice(XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX) back online!
2022-08-17 20:24:16.305 DEBUG (MainThread) [custom_components.meross_lan] MerossApi: MQTT SEND device_id:(XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX) method:(GET) namespace:(Appliance.System.Runtime)
bradleysimard commented 2 years ago

I've noticed this too.

My Meross devices are reporting "unavailable" back and forth all the time.

None of my other devices that use WiFi (located close to the same locations) are having these issues.

krahabb commented 2 years ago

Hello, this reminds me of a particular mqtt timeout disconnect behaviour which was discussed here #192

If this is the case and meross_lan was configured through MQTT (you see this when the configuration panel does not allow you to enter/modify the device address) once the MQTT connection stalls, meross_lan should switch automatically to the last known IP address and start by HTTP (seamlessly - you just see a log stating a protocol switch)

If this doesn't work by itself it could be the device, once disconnected by mosquitto (the MQTT specification states the broker should disconnect clients after a certain timeout but the implementation either in mosquitto or in the meross device could be 'funny' so to say), enters a reboot state thus being offline for some like 30-60 seconds and you'd see this in the history/log for the device/entities

timnolte commented 2 years ago

OK, I think this brings me to another aspect that I feel like my devices that are "primarily" setup via MQTT are not properly being seen as available via HTTP/IP. I do have all of my Meross devices setup in DHCP with fixed IP addresses.

tunisiano187 commented 2 years ago

Same here, everything is OK on the meross app

Kilberz commented 2 years ago

Same - OK in the APP but deep dropping using this integration.

krahabb commented 2 years ago

For @tunisiano187 and @Kilberz, This is issue is hardly related to your cases since your devices are paired to the app and the Meross cloud MQTT servers so meross_lan is not using MQTT to communicate with them.

When your devices are paired to/with the official Meross app, meross_lan can only use HTTP to communicate with them: in this scenario, it would be better to fix their IP address (this is usually achieved by configuring your own router/DHCP settings) so they're always communicating with the same address. meross_lan is not able to detect an address change of the device (it can happen when device reboots or even on timeout depending on how the router is configured/behaving)

In my experience, I also see some random disconnects which usually recover in few seconds (provided the device has a fixed IP) This issue (HTTP disconnection) appears as unresolvable to me since it is caused by the device itself which sometimes rejects meross_lan requests with no apparent reason: it just doesn't reply. This issue, however, shouldn't happen so often to make the devices unusable.

If the issue is very persistent the reason might be the configuration of the device key in meross_lan. If you're unsure, enter the configuration panel, delete the content of the key field and select the 'cloud retrieve' option for the key mode: this will prompt you to enter your Meross account login information in order to retrieve the correct key for the device.

tunisiano187 commented 2 years ago

The ips are fixed, the main problem is the connection lost since it connects and disconnect in less than 30 seconds. So IPs does not change in that time.

jragarw commented 2 years ago

chiming in after updating from HAS 10.something to the latest this is also occurring on just one of my devices. I have 14 Meross plugs, all work OK but one. All are setup in the app, also have the correct IP address listed in the device and all on static IP.

the behavior is the entities become available for a second, then unavailable for a few seconds

Living Room - DNDMode became unavailable 11:03:29 - 29 seconds ago Living Room - DNDMode turned on 11:03:29 - 30 seconds ago Living Room - outlet turned on 11:03:28 - 30 seconds ago Living Room - outlet became unavailable 11:03:19 - 39 seconds ago Living Room - DNDMode became unavailable 11:03:19 - 39 seconds ago Living Room - DNDMode turned on 11:03:19 - 42 seconds ago Living Room - outlet turned on 11:03:18 - 42 seconds ago Living Room - outlet became unavailable 11:03:14 - 1 minute ago Living Room - DNDMode became unavailable 11:03:14 - 1 minute ago Living Room - DNDMode turned on 11:03:14 - 1 minute ago Living Room - outlet turned on 11:03:13 - 1 minute ago Living Room - outlet became unavailable 11:03:09 - 1 minute ago

krahabb commented 2 years ago

This is very interesting, 1 device out 14 showing this behaviour is likely related to a very specific firmware. If you could inspect this a bit and see if the fw version is different from others (devices of the same type) that could provide some hints Also, the really 'fast' disconnection and reconnection is very strange since devices should be polled every 30 seconds . The delay could vary anyway, especially when devices don't respond, since meross_lan tries a little harder to silently re-establish the connection over few attempts and then, when also the last attempt fails, reporting the disconnection to HA

jragarw commented 2 years ago

https://imgur.com/a/GtXNlwt

Interestingly, 10 of my plugs are hardware revision 2, with a 2.X firmware, and the newest two are revision 6 with a 6.X firmware

The 6.X are the ones which are reading as unavailable with the quick disconnect reconnect.

enomam commented 1 year ago

I'm getting the same problem with my newly purchased meross switches. They have FW 6.3.6, and after about 12-16 hours - become unavailable. The only way to bring them back on is to physically turn the power off at the wall socket.

Devices with FW 2.1.16 are solid.

For all these devices, I set them up to use an mqtt broker using the instructions detailed here: https://github.com/bytespider/Meross/wiki/MQTT

DominikGebhart commented 1 year ago

Needs more info, following places to check: [Plug] --- A --- [Wifi] --- B --- [MQTT Broker] --- C --- [HA meross lan integration] A: Is the device successfully connected to Wifi? Check via router or ping the IP address. If not, make sure wifi signal is strong enough and try to use a channel that has low noise (can usually be checked and configured in the router settings) B: Is the device connected to the MQTT Broker and successfully publishing data? Check mqtt logs, observe whats sent and received i.e. with mqtt explorer. C: Check HA logs, enable debug logs for meross lan integration and check for hints whats going wrong.

bradleysimard commented 1 year ago

I just want to point out that my issues were related to my zigbee network causing micro-outages due to competing 2.4ghz channels.

I guess most of my other devices can just re-establish a connection quickly, while the meross devices were requiring manual reconnections.

Just incase anyone else has a similar setup.

krahabb commented 1 year ago

@enomam, as @DominikGebhart pointed out you should enable debug logging for meross_lan (you can now enable debug log for an integration from the '...' menu on the integration panel (just enable debug logging for any configuration entry you have and it will do that globally for the meross_lan integration) When a device works on a private MQTT meross_lan never tries to contact the device (no poll in general) unless needed to actually send a command message. This might lead to a state where the device is reported offline (because of any transient disconnection) and you can't actually send any command (since the UI prevents you from interacting with the devce being unavailable). In general, on MQTT, the device should push some messages every now and then and this is recognized from meross_lan as being online. The only 'safety' measure in meross_lan is an heartbeat (roughly on 5 minutes timeout) where meross_lan tries to 'ping' the device over MQTT in order to see if it's there or not but this is actually only meant to prevent meross_lan from thinking the device is disconnected when it's not. So, in the end, if you can't see your devices coming online at all they're likely effectively offline (with respect to HA)

In order to see what's going on you should likely check the mosquitto connect/disconnect log in order to see what is happening to the paired devices. The issue in #192 could be an hint for this too.

wsw70 commented 1 year ago

I have the same problem (I believe). I discovered it when the light in a room switched on by itself during the night several times - the wall switch was rather unstable before but I was not paying much attention so far. I ended up disabling the automation that switches the lamps when the wall switch switches to 'on'.

The switch (michael main) is a Meross mss510x (wall switch powered by the mains), connected to a Unifi AP, MQTT is a standalone mosquito, HA is 2023.10.4 and mersoss_lan is Cloudy.3 (4.3.0).

This is what I see in the device log:

image

The WiFi is stable so the next check is mosquito. Around the 10:35:30 timestamp from the meross_lan log, I have in mosquito (removed the saving in memory db lines that clutter the log)

domotique-mqtt-1  | 2023-10-22T10:34:09: Client fmware:2102089305516425584748e1e94bd8ba_NVZuM0oHVNYMgCeS has exceeded timeout, disconnecting.
domotique-mqtt-1  | 2023-10-22T10:34:17: New connection from 192.168.10.51:52603 on port 8883.
domotique-mqtt-1  | 2023-10-22T10:35:00: New connection from 192.168.10.51:52605 on port 8883.
domotique-mqtt-1  | 2023-10-22T10:35:02: New client connected from 192.168.10.51:52605 as fmware:2102089305516425584748e1e94bd8ba_NVZuM0oHVNYMgCeS (p1, c1, k30, u'48:e1:e9:4b:d8:ba').
domotique-mqtt-1  | 2023-10-22T10:35:02: OpenSSL Error[0]: error:140370E5:SSL routines:ACCEPT_SR_KEY_EXCH:ssl handshake failure
domotique-mqtt-1  | 2023-10-22T10:35:02: Client <unknown> disconnected: Protocol error.
domotique-mqtt-1  | 2023-10-22T10:35:51: Client fmware:2102089305516425584748e1e94bd8ba_NVZuM0oHVNYMgCeS has exceeded timeout, disconnecting.
domotique-mqtt-1  | 2023-10-22T10:36:29: New connection from 192.168.10.51:52606 on port 8883.
domotique-mqtt-1  | 2023-10-22T10:36:29: New client connected from 192.168.10.51:52606 as fmware:2102089305516425584748e1e94bd8ba_NVZuM0oHVNYMgCeS (p1, c1, k30, u'48:e1:e9:4b:d8:ba').
(no more relevant logs, the device has reconnected fine)

192.168.10.51 is indeed the IP of the switch. All this happened without any manual interaction with the switch.

I saw that many of you in this thread had similar issues - is there a consensus on where to go next?

When searching for the issue, the usual culprit is "bad certificate" but this is not my case: the handshake failure is intermittent and eventually everything is fine.

krahabb commented 1 year ago

OpenSSL Error[0]: error:140370E5:SSL routines:ACCEPT_SR_KEY_EXCH:ssl handshake failure

@wsw70 , could it be we have 'another' mosquitto bug? In my experience it happened (not really lately..more a couple years ago) that some mosquitto releases were inconsistent to say the least.

I'm actually using mosquitto 2.0.12 and it reports some disconnections too (but I don't have detailed logs for that). These in turn don't affect (at least not that I'm aware of) the overall device availability in meross_lan

My devices are anyway configured for automatic protocol switching (in meross_lan) and are therefore usually accessed or accessible via HTTP so I guess I cannot detect MQTT unavailability in every-day life....

wsw70 commented 1 year ago

I'm actually using mosquitto 2.0.12 and it reports some disconnections too (but I don't have detailed logs for that). These in turn don't affect (at least not that I'm aware of) the overall device availability in meross_lan

I have 2.0.18 and several devices that connect to MQTT. Somehow only the Meross devices disconnect and reconnect (though a timeout)

My devices are anyway configured for automatic protocol switching (in meross_lan) and are therefore usually accessed or accessible via HTTP so I guess I cannot detect MQTT unavailability in every-day life....

Mine too (I think: there is nothing chosen between auto, mqtt and http). I will try to force one device to use only http to see how it goes.

One question: my firmware is 3.1.5, I see mentions of 4.x versions - have you upgraded? if so - how?

krahabb commented 1 year ago

My firmware for mss310(s) is 2.1.4 and it is the latest available since one of them is Meross-binded and doesn't notify any update...all of my devices are really 'almost legacy'

wsw70 commented 1 year ago

My firmware for mss310(s) is 2.1.4 and it is the latest available since one of them is Meross-binded and doesn't notify any update...all of my devices are really 'almost legacy'

Ahhh, you mean that upgrading to the latest version means that it is not possible anymore to use a local MQTT because of some binding made between the device and Meross cloud? Or that they do not use MQTT anymore (and rely on HTTP)? Or something else?

I upgraded one device to 4.something and the pairer would not work anymore (and the latest version could not be installed on my phone). I had to go to bed so I did not investigate further but I will do so because I need to understand which strategy to take for the new devices I will buy :)

krahabb commented 1 year ago

I really don't know..but I hardly think the whole MQTT protocol has been dropped...newer devices/frimwares had always added protocols/features (like HK and now MATTER) rather than removing the MQTT/HTTP original interfaces

DominikGebhart commented 1 year ago

I upgraded one device to 4.something and the pairer would not work anymore (and the latest version could not be installed on my phone).

Newer version might need to use the newer WifiX pairing stuff, see https://github.com/bytespider/Meross/pull/60

wsw70 commented 1 year ago

Newer version might need to use the newer WifiX pairing stuff, see bytespider/Meross#60

Thank you! I managed to pair it without problems.

fuomag9 commented 9 months ago

The only way to bring them back on is to physically turn the power off at the wall socket.

There is actually a way to do this, at least on my meross 315. I've discovered that they drop the mqtt connection but stay connected to the wifi. If I disconnect them from the wifi from my unifi control panel they reconnect to mqtt without needing to physically interact with them or turning off the device! image

gary-sargent commented 8 months ago

I'm also getting frequent disconnects from my power sockets. I'm on the latest version of MerossLan 5.0.2

I have quite a few devices, all still connected to Meross Cloud, all with reserved DHCP entries, all connected in the integration via http method.

image

Can nothing be done about this?

gary-sargent commented 8 months ago

Also just to add I have ping monitoring of all the devices, and they are being pinged every 20 seconds. Every single ping is coming back with a reply - suggesting no network issues.

fuomag9 commented 8 months ago

Also just to add I have ping monitoring of all the devices, and they are being pinged every 20 seconds. Every single ping is coming back with a reply - suggesting no network issues.

Yeah, I'm 100% sure it's a software issue, maybe in Meross-lan? The disconnect method seems to have broken recently (I.e. the plug does not recover) so I'm stuck with non functional ones until I replug them 😭

timnolte commented 8 months ago

For those having disconnect issues I'd be certain that you aren't using the blank device key hack. Since I've gone through and properly setup all of my devices with a device key, including creating local only users for every device in Home Assistant with the proper password that corresponds to the device key and MAC address, I've no longer experienced any disconnects. My devices have been rock solid and working over both IP & MQTT automatically.

timnolte commented 8 months ago

Oh, I would also say if at all possible setup your local IP addressing such that your devices always get assigned the same IP address. I've configured my local LAN DHCP server to statically assign fixed IP addresses to all of my Meross devices. This ensures the IP addresses don't change which can also cause devices to be more susceptible to connectivity issues.

gary-sargent commented 8 months ago

@timnolte what do you mean by "blank device key hack"? If I click configure integration in HA, then the "device key" field is filled in for devices that are going offline. My devices already always get the same static IP.

I'm not using MQTT I'm using the http method.

timnolte commented 8 months ago

@gary-sargent well, I will point out that this issue was targeted around using MQTT. So I'm not certain that it's appropriate to be continuing to have folks posting other issues to this thread that aren't specific to the original issue. Using the Custom Pairer App is specifically for use with MQTT. It sounds like you perhaps also have your devices connected to Meross Cloud and are using local IP control? Do you have your devices set to http only and not auto?

gary-sargent commented 8 months ago

@timnolte yes they are set to http only, not auto.

timnolte commented 8 months ago

@gary-sargent are your devices HomeKit compatible? I've selected most of my devices to not be the ones compatible with HomeKit because I had issues earlier on with those. I do have two 3-way switches and my garage door opener that are HomeKit compatible but they have been solid recently after reconnecting them via MQTT.

gary-sargent commented 8 months ago

I think one of them is HomeKit and it's the only stable one ironically! I normally get non-HomeKit as they are cheaper.

gary-sargent commented 8 months ago

Here is an example from one device:

image

gary-sargent commented 8 months ago

Captured debug log when it happens:

2024-03-14 18:38:36.465 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Polling begin
2024-03-14 18:38:36.466 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] TX(http) GET Appliance.System.All (messageId:22528753f9e044758e609863e01b2307)
2024-03-14 18:38:36.564 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] RX(http) GETACK Appliance.System.All (messageId:22528753f9e044758e609863e01b2307)
2024-03-14 18:38:36.564 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Polling end
2024-03-14 18:39:06.565 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Polling begin
2024-03-14 18:39:06.566 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] TX(http) GET Appliance.System.All (messageId:80f59188631f42a4a3edcd67f5f89bda)
2024-03-14 18:39:08.328 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] HTTP ERROR GET Appliance.System.All (messageId:80f59188631f42a4a3edcd67f5f89bda ServerDisconnectedError:Server disconnected attempt:0)
2024-03-14 18:39:08.329 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] TX(http) GET Appliance.System.All (messageId:80f59188631f42a4a3edcd67f5f89bda)
2024-03-14 18:39:08.387 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] HTTP ERROR GET Appliance.System.All (messageId:80f59188631f42a4a3edcd67f5f89bda ServerDisconnectedError:Server disconnected attempt:1)
2024-03-14 18:39:08.387 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] TX(http) GET Appliance.System.All (messageId:80f59188631f42a4a3edcd67f5f89bda)
2024-03-14 18:39:08.432 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] HTTP ERROR GET Appliance.System.All (messageId:80f59188631f42a4a3edcd67f5f89bda ServerDisconnectedError:Server disconnected attempt:2)
2024-03-14 18:39:08.432 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Polling end
2024-03-14 18:39:38.433 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Polling begin
2024-03-14 18:39:38.433 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Going offline!
2024-03-14 18:39:38.434 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Polling end
2024-03-14 18:40:08.434 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Polling begin
2024-03-14 18:40:08.435 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] TX(http) GET Appliance.System.All (messageId:94668f155d02450e9bf3b88311a61641)
2024-03-14 18:40:08.517 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] RX(http) GETACK Appliance.System.All (messageId:94668f155d02450e9bf3b88311a61641)
2024-03-14 18:40:08.517 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Back online!
2024-03-14 18:40:08.518 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] TX(http) GET Appliance.System.Runtime (messageId:eb7117327753488dba8f4781c383a15b)
2024-03-14 18:40:08.572 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] RX(http) GETACK Appliance.System.Runtime (messageId:eb7117327753488dba8f4781c383a15b)
2024-03-14 18:40:08.573 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################14] Polling end
krahabb commented 8 months ago

Hello @gary-sargent, This behavior has been reported from time to time and I've never found a reason for it. My devices too sometimes had shown this (not anymore lately though) but it always popped up as a 'glitch' here and there. That's why there's an algorithm to quickly reattempt the connection whenever it fails (like your log states) My actual guess is the device goes into a sort of internal failure and just reboots (at least this is what was happening on mines) so it looses reachability for like a minute or so. meross_lan, when a device falls offline, uses a 'relaxation' algorithm trying to reconnect at the next polling loop (i.e. after 30 seconds) and then increasing the period between attempts in order to not 'waste' system resources when a device is not going to return (this period is anyway maxed out at 5 minutes if I remember correctly)

Since your devices are still cloud connected you could see if using the 'Auto' protocol option (and configuring the Meross cloud profile) in meross_lan mitigates this disconnections (so that when HTTP fails the cloud MQTT works as a backup) but, if the devices are really going into reboot this will likely expose the disconnection too.

By the look of your devices list/statistics it seems to me the major suspects should go to the combo fw/hw and this could confirm the issue is an internal (firmware likely) one.

You can anyway check if my reasoning about reboot is correct by trying to 'catch' your devices when offline and see if they're entering a reboot loop (so that you should see some blinking on the device)

I could also 'throw' in another guess: since Meross looks very sensitive and prone to limiting cloud traffic, your relatively large device count, might hit their rate-limiting and some devices might be suffering this more than others. You could try unbind 1 of your often-failing mss110 from the cloud and pair to a local MQTT and see if it behaves different.

garysargentpersonal commented 8 months ago

I don't think they are rebooting. I have had no lost pings pinging them every 20 seconds, the wifi connection stats on my router don't show a connection break, and the firewall shows a successful connection to the cloud servers every 2 minutes without gaps.

gary-sargent commented 8 months ago

@krahabb I have some further information on this. It looks like the HTTP endpoint on these devices is single threaded.

If I perform the following on the command line:

curl -X POST http://192.168.123.149/config (Replace IP address with the IP of a device)

Then the above command hangs indefinitely. From this point on, the integration is completely unable to communicate via HTTP with the device. The moment you kill the above curl, everything starts working again.

Do you have logic in place to ensure only one connection can be made to the endpoint at any one time, with a suitable timeout so that no single connection can hog it for long?

krahabb commented 8 months ago

Hello @gary-sargent, Thank you for this precious info. Actually there's no 'strong synchronization' in the code that queries the device. It is 'reasonably' serialized by the fact that the polling is sequential (i.e. every polling command in the cycle 'awaits' for the preceeding to finish off before being submitted to the device) even though I'm not 100% sure I've mastered the python async model which is the one the code is based off.

What might happen for sure is that the supposedly correctly serialized polling might interleave with other async commands like those originated from the UI when requesting any action (or from automations interacting with the entities)

I'll go for sure for a thorough review of this possible overlapping and try set a barrier for this

gary-sargent commented 8 months ago

@krahabb there definitely seems to be something not right when you have a large number of devices (I have 18). I get many more devices going offline - especially ones that have a higher latency.

For example, here is one from my logs - where device 13 starts polling, but then 8 and 18 jump in, then 13 has an error:

2024-03-15 16:23:19.084 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################13] Polling begin
2024-03-15 16:23:19.084 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################13] TX(http) GET Appliance.System.All (messageId:18900a4517c74c6fb728b8c4ab591f5f)
2024-03-15 16:23:19.887 DEBUG (MainThread) [custom_components.meross_lan.mss110_###############################8] Polling begin
2024-03-15 16:23:19.891 DEBUG (MainThread) [custom_components.meross_lan.mss110_###############################8] TX(http) GET Appliance.System.All (messageId:02f177427f7544f2af1425060a2503de)
2024-03-15 16:23:20.094 DEBUG (MainThread) [custom_components.meross_lan.mss110_###############################8] RX(http) GETACK Appliance.System.All (messageId:02f177427f7544f2af1425060a2503de)
2024-03-15 16:23:20.094 DEBUG (MainThread) [custom_components.meross_lan.mss110_###############################8] Polling end
2024-03-15 16:23:20.116 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Polling begin
2024-03-15 16:23:20.117 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] TX(http) GET Appliance.System.All (messageId:e82eeea587a7447cb71c9ba8dff7f777)
2024-03-15 16:23:20.342 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################13] HTTP ERROR GET Appliance.System.All (messageId:18900a4517c74c6fb728b8c4ab591f5f ServerDisconnectedError:Server disconnected attempt:0)
2024-03-15 16:23:20.342 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################13] Polling end
2024-03-15 16:23:20.462 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] RX(http) GETACK Appliance.System.All (messageId:e82eeea587a7447cb71c9ba8dff7f777)
2024-03-15 16:23:20.463 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Back online!
2024-03-15 16:23:20.467 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] TX(http) GET Appliance.System.Runtime (messageId:648a6e5409eb40c492f77062e0f0ed09)
2024-03-15 16:23:20.697 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] RX(http) GETACK Appliance.System.Runtime (messageId:648a6e5409eb40c492f77062e0f0ed09)
2024-03-15 16:23:20.698 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Polling end
timnolte commented 8 months ago

@gary-sargent so, I don't really think it's broadly an issue of the number of devices. I actually just looked and I have exactly 18 Meross devices. However, I have them all connected via local MQTT, not connected to the cloud at all, and they are all setup as auto so they will switch between http & mqtt automatically as needed. This could be an issue related to IP only & cloud with many devices any being rate limited or something.

gary-sargent commented 8 months ago

Not sure, with all devices enabled I'm seeing some kind of issue on pretty much 1 in 4 polling cycles, and the polling looks to be interleaved when it happens, rather in sequence as normal. Smells like a bug to me.

In a normal set of logs each device takes a polling turn from start to end. When something goes wrong, other devices start polling in the middle of an outstanding poll (see my debug logs above).

krahabb commented 8 months ago

Polling cycles are 'x device' so devices polling might be interleaved without any issue or whatever since their clients are separate (trusting the core python frameworks)

Even if it were an issue, in the log, device 8 starts getting polled 0.8 seconds after device 13 and the latter still hasn't responded so I'm still strongly convinced the issue lies in the device stack.

It might be you observe the polling cycles being 'interleaved' only when issues arise because, when a device doesn't reply or gives an error, that polling cycle lasts more then a fraction of a second (say dev 13 in the log) while, cycles without issues are relatively fast (lessening the probability of interleaving with others)

krahabb commented 8 months ago

I'm trying to let the device http service hang but I cannot really let it fail. In my tests I was trying to 'overflow' the device by sending each http request in parallel (not sure how parallel since it is done in python async) trying to build up to 10 'concurrent' connections. I could then see requests and replies randomly interleaved showing that the device seems to be able to manage multiple concurrent connections. The device is an old mss310 (hw 2.0.0 - fw 2.1.4)

At any rate, even if the python runtime is not really throwing out concurrent requests (so that what I see is just some fog) the result is that, in the python execution environment (so the one where meross_lan/HA lies) I can manage to 'pump' 10 'almost concurrent' connections spawned and interleaved in a shorter than 0.5 seconds time window without any disconnection or timeout. At the moment I'm not able to use curl like you said in order to 'hang' the device since when invoked, curl just exits with an error (the device doesn't reply)

I was in the middle of trying to refactor the http client code to limit the number of concurrent connections anyway (still stating that in current implementation, the only moment we could have 'concurrency' is when a user issues a command from HA UI (or automations) to the device exactly while the polling loop is in place. But having no other evidence this is really needed I'm scared this kind of fix could introduce other bugs (i.e. latencies in device command requests since they would need to be strictly serialized with any eventual polling loop) and so introducing an unneded annoying behavior.

Let's keep this open for further analysis or ideas but still, at the moment, I'm convinced the issue is in the device fw.

krahabb commented 8 months ago

I could start releasing an 'alpha' with this kind of single connection concept and so we could see if that solves this issue. If it works we'll then see if it also impacts command latency too much

fuomag9 commented 8 months ago

I could start releasing an 'alpha' with this kind of single connection concept and so we could see if that solves this issue. If it works we'll then see if it also impacts command latency too much

I'm absolutely testing this asap ^^

timnolte commented 8 months ago

@krahabb could this be an issue of Meross LAN going against the intended infrastructure and connectivity design for devices by Meross, and newer devices being more strict about that?

What I mean by that is, is Meross' design really the intention that all activity happens via HTTP requests? The very fact that you can configure most devices with your own local MQTT server, instead of using Meross Cloud, leads me to wonder if the HTTP requests capabilities are really only intended to be used from perhaps the Meross mobile app locally but other things, for example automations, are supposed to be performed via MQTT where things are being put into a queue.

It seems reasonable to me that the expectation for devices would be that if I'm using the mobile app directly the interactions with devices should be perceived to be instant. However, it could be thought of as acceptable that other requests, for example automations or other connected services (Google/Alexa/etc) are required to go through MQTT.

I say all of this because it may be best for device compatibility to ensure that this integration follows whatever Meross' design is. If people want true local only control then they have to run their own local MQTT, otherwise the integration is going to use Meross Cloud except for some things that are "acceptable" to run over HTTP by design. This would essentially be making the integration "auto" mode the only way to configure devices.

gary-sargent commented 8 months ago

It would certainly be useful to get more debug logging around what the failure reason is if that's possible. I'm in the situation where at least one device is going offline every other poll cycle which is crazy. It never used to be like this, and Meross haven't updated the firmware. Is it worth looking at recent changes which could have caused this in the integration?

krahabb commented 8 months ago

I cannot figure out what if any of the changes could have worsened this issue. Latest release (Moonlight) just introduced 'bulkier' messages. These are bigger in terms of data exchanged per message but should lessen the number of overall http transactions, thus lessening any 'contention' or latency issues. (Say every cycle was requesting 3 messages in 3 separate transactions before, now they're grouped in a single http exchange for all of the 3)

I've already migrated the code to use a single serialized connection but I'm in the middle of adjusting the test environment (by the look of it, tests need more effort than the code changes themselves...)

In actual tests, I was indeed able to finally 'crash' my mss310 by increasing the number of simultaneous http requests (being honest I'm still not sure to be able to control the python http concurrency but I guess I've hit some limits when I've tried to launch 10 almost simultaneous connections which likely ended up in being really concurrent at the device level)

With the new code, the same pattern that led to the device crashing, shows no issue at all (since the pending requests were correctly and nicely serialized). This looks promising and I'm striving to release the patch so that you could check it

gary-sargent commented 8 months ago

I've got a man-in-the-middle app logging the traffic to one of my devices. When it goes wrong, I see "client connect" logged twice a second apart, and then an error saying "client disconnected". Retries then show "server closed connection".

Home Assistant logs:

2024-03-18 18:10:02.783 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Polling begin
2024-03-18 18:10:02.786 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] TX(http) GET Appliance.System.All (messageId:fa6cb9a1b93e405096bfd3e4b41c8735)
2024-03-18 18:10:03.062 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] RX(http) GETACK Appliance.System.All (messageId:fa6cb9a1b93e405096bfd3e4b41c8735)
2024-03-18 18:10:03.063 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Polling end
2024-03-18 18:10:13.064 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Polling begin
2024-03-18 18:10:13.066 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] TX(http) GET Appliance.System.All (messageId:bd4c59fa13a8440d83ab9b441a3083b0)
2024-03-18 18:10:14.291 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] HTTP ERROR GET Appliance.System.All (messageId:bd4c59fa13a8440d83ab9b441a3083b0 ClientResponseError:502, message='Bad Gateway', url=URL('http://192.168.123.100/config') attempt:0)
2024-03-18 18:10:14.291 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] TX(http) GET Appliance.System.All (messageId:bd4c59fa13a8440d83ab9b441a3083b0)
2024-03-18 18:10:14.377 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] HTTP ERROR GET Appliance.System.All (messageId:bd4c59fa13a8440d83ab9b441a3083b0 ClientResponseError:502, message='Bad Gateway', url=URL('http://192.168.123.100/config') attempt:1)
2024-03-18 18:10:14.377 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] TX(http) GET Appliance.System.All (messageId:bd4c59fa13a8440d83ab9b441a3083b0)
2024-03-18 18:10:14.449 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] HTTP ERROR GET Appliance.System.All (messageId:bd4c59fa13a8440d83ab9b441a3083b0 ClientResponseError:502, message='Bad Gateway', url=URL('http://192.168.123.100/config') attempt:2)
2024-03-18 18:10:14.449 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Polling end
2024-03-18 18:10:24.450 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Polling begin
2024-03-18 18:10:24.451 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Going offline!
2024-03-18 18:10:24.452 DEBUG (MainThread) [custom_components.meross_lan.mss110_##############################18] Polling end

MITM logs:

[18:10:13.062][192.168.123.200:57850] client connect
[18:10:14.063][192.168.123.200:57854] client connect
[18:10:14.088][192.168.123.200:57850] server connect 192.168.123.149:80
192.168.123.200:57850: POST http://192.168.123.149/config
    Host: 192.168.123.149
    User-Agent: HomeAssistant/2024.3.1 aiohttp/3.9.3 Python/3.12
    Accept: */*
    Accept-Encoding: gzip, deflate, br
    Content-Length: 246
    Content-Type: text/plain; charset=utf-8

    {"header":{"messageId":"bd4c59fa13a8440d83ab9b441a3083b0","namespace":"Appliance.System.All","method":"GET","payloadVersion":1,"from":"Meross","timestamp":1710785413,"timestampMs":0,"sign":"513345e3c9ef67d25313af542d8065b5"},"payload":{"all":{}}}

 << Client disconnected.
[18:10:14.097][192.168.123.200:57850] client disconnect
[18:10:14.170][192.168.123.200:57854] server connect 192.168.123.149:80
192.168.123.200:57854: POST http://192.168.123.149/config
    Host: 192.168.123.149
    User-Agent: HomeAssistant/2024.3.1 aiohttp/3.9.3 Python/3.12
    Accept: */*
    Accept-Encoding: gzip, deflate, br
    Content-Length: 246
    Content-Type: text/plain; charset=utf-8

    {"header":{"messageId":"bd4c59fa13a8440d83ab9b441a3083b0","namespace":"Appliance.System.All","method":"GET","payloadVersion":1,"from":"Meross","timestamp":1710785413,"timestampMs":0,"sign":"513345e3c9ef67d25313af542d8065b5"},"payload":{"all":{}}}

 << server closed connection
[18:10:14.284][192.168.123.200:57854] server disconnect 192.168.123.149:80
[18:10:14.285][192.168.123.200:57854] client disconnect
[18:10:14.294][192.168.123.200:57868] client connect
[18:10:14.311][192.168.123.200:57868] server connect 192.168.123.149:80
192.168.123.200:57868: POST http://192.168.123.149/config
    Host: 192.168.123.149
    User-Agent: HomeAssistant/2024.3.1 aiohttp/3.9.3 Python/3.12
    Accept: */*
    Accept-Encoding: gzip, deflate, br
    Content-Length: 246
    Content-Type: text/plain; charset=utf-8

    {"header":{"messageId":"bd4c59fa13a8440d83ab9b441a3083b0","namespace":"Appliance.System.All","method":"GET","payloadVersion":1,"from":"Meross","timestamp":1710785413,"timestampMs":0,"sign":"513345e3c9ef67d25313af542d8065b5"},"payload":{"all":{}}}

 << server closed connection
[18:10:14.370][192.168.123.200:57868] server disconnect 192.168.123.149:80
[18:10:14.370][192.168.123.200:57868] client disconnect
[18:10:14.378][192.168.123.200:57882] client connect
[18:10:14.384][192.168.123.200:57882] server connect 192.168.123.149:80
192.168.123.200:57882: POST http://192.168.123.149/config
    Host: 192.168.123.149
    User-Agent: HomeAssistant/2024.3.1 aiohttp/3.9.3 Python/3.12
    Accept: */*
    Accept-Encoding: gzip, deflate, br
    Content-Length: 246
    Content-Type: text/plain; charset=utf-8

    {"header":{"messageId":"bd4c59fa13a8440d83ab9b441a3083b0","namespace":"Appliance.System.All","method":"GET","payloadVersion":1,"from":"Meross","timestamp":1710785413,"timestampMs":0,"sign":"513345e3c9ef67d25313af542d8065b5"},"payload":{"all":{}}}

 << server closed connection
[18:10:14.442][192.168.123.200:57882] server disconnect 192.168.123.149:80
[18:10:14.443][192.168.123.200:57882] client disconnect

Previous successful poll output for comparison with only one client connect:

[18:10:02.789][192.168.123.200:45896] client connect
[18:10:02.849][192.168.123.200:45896] server connect 192.168.123.149:80
192.168.123.200:45896: POST http://192.168.123.149/config
    Host: 192.168.123.149
    User-Agent: HomeAssistant/2024.3.1 aiohttp/3.9.3 Python/3.12
    Accept: */*
    Accept-Encoding: gzip, deflate, br
    Content-Length: 246
    Content-Type: text/plain; charset=utf-8

    {"header":{"messageId":"fa6cb9a1b93e405096bfd3e4b41c8735","namespace":"Appliance.System.All","method":"GET","payloadVersion":1,"from":"Meross","timestamp":1710785402,"timestampMs":0,"sign":"1ddbbfe0adf2d0aa1276f09de54461d1"},"payload":{"all":{}}}

 << 200 OK 1.2k
    Content-Type: application/json
    Connection: close

    {
        [JSON DELETED]
    }

[18:10:03.052][192.168.123.200:45896] server disconnect 192.168.123.149:80
[18:10:03.055][192.168.123.200:45896] client disconnect