krahabb / meross_lan

Home Assistant integration for Meross devices
MIT License
416 stars 45 forks source link

Intermittent "Meross device key error"s resulting in loss of control #59

Closed SamboyCoding closed 3 years ago

SamboyCoding commented 3 years ago

Version of the custom_component

v2.1.1

Configuration


No configuration written in YAML by me, all done via the home assistant integrations UI.

Describe the bug

I have 6 MSS310 sockets setup via this integration, using your custom fork of the Mosquitto addon for home assistant supervisor. They all work fine 99% of the time, however, after approximately 24 hours of being on, I get, in the log, the following:

WARNING (MainThread) [custom_components.meross_lan] Meross device key error for device_id: 19121809795350251h2848e1e913e299

Which results in home assistant reporting the devices as unavailable for few minutes. However, the issue usually clears itself up, and the entities return to an "available" state, that is, on or off for the switches, and a reasonable value for the sensors.

However, I lose control of the switches at this point. Attempting to toggle the devices off, or switch DNDmode, results in no action being taken on the switch and the toggle returning to the previous state after about 2 seconds.

Debug log

DEBUG (MainThread) [custom_components.meross_lan] MerossApi: MQTT RECV device_id:(19121884072170251h2848e1e913e4ad) method:(ERROR) namespace:(Appliance.Control.ConsumptionX)
DEBUG (MainThread) [custom_components.meross_lan] MerossApi: mqtt_receive exception:('NoneType' object is not iterable) 

Of note immediately though, is the following, for each of my sockets. Not sure if this is intentional:

DEBUG (MainThread) [custom_components.meross_lan] MerossHttpClient(xxx.1.0.89): HTTP Response ({"header":{"messageId":"320990a1652946e8a9a2d08591536617","namespace":"Appliance.System.DNDMode","method":"ERROR","payloadVersion":1,"from":"/appliance/19121809795350251h2848e1e913e299/publish","timestamp":1628593827,"timestampMs":814,"sign":"6bbfac654c4f1d4461f34398c9e72cf8"},"payload":{"error":{"code":5001,"detail":"sign error"}}})
DEBUG (MainThread) [custom_components.meross_lan] Key error on xxx.1.0.89 (GET:Appliance.System.DNDMode) -> retrying with key-reply hack

followed then by

DEBUG (MainThread) [custom_components.meross_lan] MerossHttpClient(xxx.1.0.89): HTTP Response ({"header":{"messageId":"0bef25fee2fc43f29f52d88d46245fe6","namespace":"Appliance.System.All","method":"GETACK","payloadVersion":1,"from":"/appliance/19121809795350251h2848e1e913e299/publish","timestamp":1628593827,"timestampMs":524,"sign":"cf4325533324f07edeb5bf836054ec5a"},"payload":{"all":{"system":{"hardware":{"type":"mss310","subType":"un","version":"2.0.0","chipType":"mt7682","uuid":"19121809795350251h2848e1e913e299","macAddress":"48:e1:e9:xx:xx:xx"},"firmware":{"version":"2.1.16","compileTime":"2020/09/28 15:51:07 GMT +08:00","wifiMac":"2c:b0:5d:xx:xx:xx","innerIp":"xxx.1.0.89","server":"xxx.1.0.55","port":8883,"userId":0},"time":{"timestamp":1628593827,"timezone":"","timeRule":[]},"online":{"status":1}},"digest":{"togglex":[{"channel":0,"onoff":1,"lmTime":1628586000}],"triggerx":[],"timerx":[]}}}})
DEBUG (MainThread) [custom_components.meross_lan] MerossDevice(19121809795350251h2848e1e913e299) back online!
SamboyCoding commented 3 years ago

Okay it happened again, the first debug log (mqtt_receive exception) is the one I think is causing the issue - it's happening at about the right time, at least.

krahabb commented 3 years ago

Hello @SamboyCoding, It all looks very strange! The real issue I think is about the 'key error' in the beginning. The mqtt_receive exception is an indirect consequence since my parsing code is not robust enough to manage anything but correct messages (It fails when parsing the payload for the 'ERROR' message received from the plug, the one stating the key/sign is incorrect).

I think the issue is very subtle due to the fact that the protocol key is not setup in (meross_lan) configuration. If you leave 'key' -> empty and 'protocol' -> default when configuring any meross device (which is the default config and should be ok most of the time) this is what is happening: meross_lan tries to communicate over MQTT but when any issue arise (and the incorrect key setup might be the reason) it fails-over to HTTP using the 'key-reply hack' which makes everything working seemingly correctly

I would first direct you to configure the proper key in each meross_lan device (in the integration UI) if you used a non-empty key when you paired them to the MQTT broker. I suggest you to leave the 'protocol' configuration to deafult/auto instead since it usually doesnt pose any issue and allows the integration to fail-over when the MQTT is down for any good reason.

I will try review the parsing code for that 'mqtt_receive' exception which by the look of it shouldnt be harming: it is an error in the expected logic flow and there's an exception handler in place reporting this event to log.

Let's see if it's only a key configuration issue or not.

You can get a better trace/log by activating the feature in meross_lan integration configuration UI. Pick any of the problem devices and activate the trace-debug from the UI. You'll have 10 minutes of message dump right in the 'traces' subdir of meross_lan component. If my resoning is correct, there you should see a lot of HTTP message flow and only 'rare' MQTT attempts

krahabb commented 3 years ago

Quick update: (not even close to a solution)

The first warning 'Meross device key error for device_id:' is only logged when you explicitly set a 'key' in UI config so forget (in part) my previous consideration(s). This log comes from my code checking the key/sign against the received payload so either the payload is corrupt or the local (meross_lan) key is wrong

I thought the key-reply-hack over HTTP was only enabled with a null 'key' while in-fact it is always in place (i.e. when an HTTP command fails it retries with the hack always) and this explains the HTTP messages flowing after.

I see 2 different device ids in your logs:

I'll further review my code to see if something is 'polluting' the keys or so. By the look of it this is not easy to analyze. I suggest you either send me a full detailed DEBUG log with original sequence and timings (I mean full of meross_lan logs:) or at least 1 device trace (even if the error is not showing up).

You could also try to force 'HTTP' mode for one of the plugs and see if it behaves the same as others. I'll double check the exception handling for mqtt_receive to see if it is maybe a possible source of issues but I don't think so at the moment

krahabb commented 3 years ago

P.S. Do you see any possible pattern in the timing at which the issue occur? You stated it occured after around 24 hours of being on: do you mean after their configuration/HAss restart? Also, if you restart HA do they work correctly for another 24 hours? or maybe it's a 'point in time' like around noon or so

SamboyCoding commented 3 years ago

There is no particular time, no. It's random, it's not always 24 hours - this time I enabled logging, rebooting HA, and it occurred again in less than 12 hours (12pm-9pm). I've had it occur around 7am, 3am, 9pm, it's seemingly random.

Also, what do you mean by manually setting the local keys? I didn't do that, when configuring them, nor did I even know it was an option (though I do have it explicitly set in the integration config - should I just remove that? Is that the problem here?)

krahabb commented 3 years ago

It might be: When you use the tool to re-pair the devices to a local mqtt you can specify the 'key' used to sign protocol headers (this is not evident but I've learned here a lot of users prefer to setup 'their' private key to strengthen a bit the configuration) If you don't, the device is configured with an empty key (still the signing stuff is working but the key is simple enough).

My code works 'by default' with an empty key too so in the integration UI you should leave it empty in order to correctly talk to this re-paired devices (unless their key was set differently).

You set this key (for every device can be set independently - a bit overkill:) in the integration configuration UI for the device.

Now to a possible explanation: If the real (configured) key of the device doesn't match the meross_lan configuration what happens is that meross_lan tries to talk MQTT to the device but it fails so it switches to HTTP. When talking HTTP meross_lan tries (and is mostly able) to overcome any signing difficulty by applying an hack: so , if your keys dont match in configuration, HTTP could be able to talk anyway. That's why, if you configure a mismatched key meross_lan is probably working without issues (unless you inspect the log where you'll find some warnings here and there)

All good, even if not perfect everything seems to work but in reality something could happen: sometimes the code will try to switch back to MQTT protocol (because it is better to use!) and will experience the key problem so, eventually, it will switch back to HTTP and keep going on. This automatic switch, though I've tried to make it really robust, is always a bit 'random' to say the least: in fact, timeouts, latencies or whatever could make things work or not in few seconds.

Not only this. I've noticed the HTTP protocol is failing relatively often in my environment (I don't get any response from the device). I still haven't figured out any reason for this, since it happens like this: you query the device, it timeouts, you requery immediately, it works. So it is probably a device firmware issue (blaming someone else here :) When this failure happens, my code tries to switch to MQTT if it recognizes it could work and so it might enter a dead loop where it keep switching back and forth until something goes right.

Now, anyway, there's an heartbeat inside meross_lan which sooner or later should 'regain' communication with the devices so even if they fall offline because of communication issues they should be back automatically. Like I've said, the HTTP failures are really 'instant' and the device should be able to respond right away. But again, with timing, latencies and automatic failover you could sometimes fall away.

So to recap: if you configured any 'key' in meross_lan remove it (from every device) so you'll be synced with the real 'pairing configuration'. This way we should be able to remove the major cause of failure

SamboyCoding commented 3 years ago

Okay, I've cleared all the keys out of the meross_lan integrations, as I didn't set any on the devices. I'll give it some time and if the issue doesn't reoccur, will close this issue. Thanks for the help and the amazing writeups on what's going on behind the scenes!

SamboyCoding commented 3 years ago

It's not since happened, so I'm going to call this user error on my part. Thanks for the help!

krahabb commented 3 years ago

Thank you, I'm glad we figured this out. I was very scared in the beginning;)