technyon / nuki_hub

Use an ESP32 as a Hub between a NUKI Lock and your smarthome.
MIT License
523 stars 40 forks source link

Inconsistent lock binaryState/state in MQTT #51

Closed alexdelprete closed 1 year ago

alexdelprete commented 1 year ago

Today I noticed that lock sensor in HA was reporting unlocked when Nuki was locked, I checked MQTT and noticed this:

image

You will notice that lock/binaryState is unlocked but lock/state is locked.

I restarted Nuki Hub to force it to reset the states from the lock and everything was good again.

BTW: is it possible to add a restart button in the UI? right now I go into settings and hit save to force it to restart)

technyon commented 1 year ago

Thanks, I'll look into it.

alexdelprete commented 1 year ago

Thanks to you.

Another weird thing, that might be related: I found some strange topics in MQTT that were similar to nuki_hub topics, but with parts missing: _hub_hub with battery info, then I found another weird topic, I made a screenshot yesterday but can't find it...I'll'attach it as soon as I find it. It's like there was some problem in sending data to MQTT from Nuki Hub. It only happened 2 times...I'm on latest version of the firmware.

UPDATE: found the screenshot. :)

image

mundschenk-at commented 1 year ago

Now that you mention it, I have several of those too. Looks like there is some memory corruption going on somewhere (which would also explain why Mosquitto complains about malformed messages every once in a while). Screenshot MQTT topics

alexdelprete commented 1 year ago

I hope it's a sw bug that will be solved. The inconsistency in the status of the Lock I noticed could also depend on this.

technyon commented 1 year ago

There's nothing obvious, the state and binaryState are always updated together. If it's an MQTT problem, it'll get more complicated. I don't have any such malformed topics, but I'm not using mosquitto. As far as I can see the correct topic is passed to the MQTT stack (pubsubclient), what happens from there I don't know since it's not my code.

mundschenk-at commented 1 year ago

Is it possible that memory on the ESP32 is getting corruped, maybe by a parallel thread? (I am thinking of the WiFi stack in particular, as that would be something you would not experience on your own device?)

technyon commented 1 year ago

I've reviewed the code building MQTT paths. It's a rather simple method, still there were some stupid mistakes which should be fixed now, you can try release 6.1. If this fixes the actual issue with the binary state I don't know, please test.

alexdelprete commented 1 year ago

I don't have any such malformed topics, but I'm not using mosquitto.

I don't use mosquitto either, I use EMQX v5. So same issue with 2 different brokers.

there were some stupid mistakes which should be fixed now, you can try release 6.1. If this fixes the actual issue with the binary state I don't know, please test.

That's what I hoped: some sw bug that could easily be solved. Thank you.

Strange that it did it never occur to you, and also strange why did it occur to me only after weeks of using Nuki Hub. Usually these stupid bugs trigger more often. :)

If this fixes the actual issue with the binary state I don't know, please test.

That inconsistency condition only happened 1 time since I installed Nuki Hub, so I will monitor things and report back in case I observe it again. What allows me to be optimistic is that I observed those malformed topics and the inconsistency in the same time-frame.

Will keep you poste, thank you for reacting so quickly.

technyon commented 1 year ago

After fixing those errors, I'm somewhat surprised that it did work before. I'm using iobrokers built-in MQTT broker, maybe it handles garbled messages/paths better than mosquitto.

alexdelprete commented 1 year ago

I'm using iobrokers built-in MQTT broker, maybe it handles garbled messages/paths better than mosquitto.

I think you have to follow a strict protocol/spec in order for an MQTT broker to accept data, so if the integrated broker is a true MQTT broker, it means it received the topics malformed, but correct from an MQTT spec perspective, and so it should process them as EMQX and Mosquitto did.

The fact that it worked before also for me for almost a month will remain a magic mistery. :)

I have some other issues with the web interface (also with previous 6.0 fw), pages render with no formatting often, if I refresh a couple of times I see the formatting (black background, etc.), but sometimes it throws an http error (a connection reset error) then after 2-3s it appears again. It seems like http connections are not managed very well...that's the feeling.

Thanks a lot for your support.

technyon commented 1 year ago

Probably related to issue #48, let's follow up there.

giejay commented 1 year ago

Im still seeing this issue:

front-door/lock/state unlocked
front-door/lock/binaryState locked

Using 6.1

alexdelprete commented 1 year ago

Im still seeing this issue:

didn't occur to me yet, after 6.1. But I used nuki_hub for quite some time without the issue, then one day it happened...when I started seeing malformed topics in mqtt.

Do you have malformed topics? open all nuki_hub subtopics to check for weird names...

technyon commented 1 year ago

Let's keep in mind that the malformed topics aren't necessarily related to the incosistent states.

giejay commented 1 year ago

For me it seems te be related to having multiple nuki locks: https://github.com/technyon/nuki_hub/issues/53 I have moved one back to another project and now I don't see this anymore.

alexdelprete commented 1 year ago

Let's keep in mind that the malformed topics aren't necessarily related to the incosistent states.

Correct, we don't know, that's why I asked, so we can check for patterns.

mundschenk-at commented 1 year ago

I have updated my device to 6.1 and cleaned out all the messed up topics using mosquitto_pub. After a few days, it looks like new broken topics have appeared. Is it possible there is still an issue somewhere else in the MQTT code?

technyon commented 1 year ago

The question is where this happens. As far as I can tell, the paths should be built correctly now. After that comes the MQTT library. Just to verify everything is correct on my side, I've built a version that outputs all paths to the serial console.

nuki_hub-mqtt-dbg-1.zip

alexdelprete commented 1 year ago

I have updated my device to 6.1 and cleaned out all the messed up topics using mosquitto_pub. After a few days, it looks like new broken topics have appeared. Is it possible there is still an issue somewhere else in the MQTT code?

Since 6.1 I haven't seen malformed topics, but I guess it's too soon to say it's fixed. I'll keep monitoring it and keep you posted.

mundschenk-at commented 1 year ago

Unfortunately, for me they are still there. To make sure they are not artifacts, I cleaned up again after posting https://github.com/technyon/nuki_hub/issues/51#issuecomment-1336360175 and this is a screenshot of today: 055F71C7-B800-415A-B7FD-BC2AD1C39B4B

Mincka commented 1 year ago

Few other examples from 6.7: image I am cleaning everything again. Could it be related to automated or triggered restarts?

alexdelprete commented 1 year ago

I am cleaning everything again. Could it be related to automated or triggered restarts?

So the bug is still there. I just checked mine (6.8), and very interestingly, the same Topic is malformed, can't be a coincidence:

image

I also noticed this rssice topic with a child topic...doesn't seem normal:

image

Mincka commented 1 year ago

I attempted to find issues in the code but the only thing I found that could cause memory issues is a RSSI value that would not be not integer (not even sure that this possible). The itoa() would cause unexpected results. https://github.com/technyon/nuki_hub/blob/master/Network.cpp#L376

Or maybe it's another bug in the Network::buildMqttPath function.

It would mean that checking for end of strings with "0x00" would not always work or that the fact that the generated path itself would cause issues because its array is too big and padded with "0x00". I am not familiar with this method of initializing string arrays with 0x00 so sorry if I say stupid things. https://github.com/technyon/nuki_hub/blame/master/Network.cpp#L303

mundschenk-at commented 1 year ago

@Mincka In C (and derived languages like C++), strings are pointers to zero-terminated arrays of characters.

technyon commented 1 year ago

It's a mystery to me why this is happening. There's nothing too special about the rssi node, it's published via the publishInt() method, which is used by for a few other nodes too (batteryDrain, lockDistance, ledBrightness). Maybe the bug is triggered more easily with the rssi node because it's constantly updated ... same goes for wifiRssi and presenceDetection though.

mundschenk-at commented 1 year ago

I have the issue with other nodes as well (see previous screenshots). It is not limited to any single topic.

Mincka commented 1 year ago

@technyon Do you have this issue yourself? Maybe we have something else in common that can be causing this. @giejay was thinking about something related to multiple locks. I have two locks with one ESP M5 Atom per lock. Do you have multiple locks @mundschenk-at and @alexdelprete ?

mundschenk-at commented 1 year ago

Only one lock. Also, the ESP can only be paired to one lock anyway.

Mincka commented 1 year ago

I just saw this in the log: image The ESP rebooted after the configured 30 minutes restart. I'm thinking about a corrupted MQTT command because the ESP is restarted during the publishing of a value. It would happen more frequently with wifiRssi than others because there are more chances that every 30 minutes, when the device is rebooting, it's also updating wifiRssi.

I disabled Restart timer on both devices and will report back if I still have the issue.

alexdelprete commented 1 year ago

Do you have multiple locks @mundschenk-at and @alexdelprete ?

only one lock.

what about external libraries? how does the sw interface the mqtt broker? could it be a bug in the library used for that scope?

alexdelprete commented 1 year ago

It is not limited to any single topic.

true. it's not the same topic. I found 4-5 different topics and subtopics malformed.

technyon commented 1 year ago

NUKI Hub uses the wiedely used PubSubClient to act as an MQTT client.

alexdelprete commented 1 year ago

Does it use TCP to communicate with the broker? If so, corruption can't happen at that level, so it must be the buffer/parameters being passed to PubSubClient originally that sometimes is malformed. The problem is that you can't reproduce the issue to debug it. :(

Mincka commented 1 year ago

@alexdelprete I don't know if you have the automated restart schedule but if this the case, you can try to disable it, delete your MQTT topics and see if it's better after a few days.

technyon commented 1 year ago

The corruption could happen at several layers, I think it's unlikely though that it happens in PubSubClient, the Wifi or TCP stack, that code is used in many other projects, I think it would be fixed by now. Maybe it makes sense to log the topic on each publish to see if anything malformed is passed to PubSubClient, I can build a binary.

alexdelprete commented 1 year ago

I don't know if you have the automated restart schedule

Nope, I have it disabled (-1).

Maybe it makes sense to log the topic on each publish to see if anything malformed is passed to PubSubClient, I can build a binary.

Sounds like a very good idea to understand what is going on.

technyon commented 1 year ago

Here you go. It logs all published topics, only to the serial line though in order to not interfere with MQTT. Not sure how helpful it'll be, because it's a lot of output.

nuki_hub-6.9-log-topics.zip

alexdelprete commented 1 year ago

ouch! on serial only? is it possible to write to a network share or a syslog?

alexdelprete commented 1 year ago

I still have the issue: the malformed topic locntenance seems like a weird merge of the lock topic and the maintenance topic.

image

mundschenk-at commented 1 year ago

@alexdelprete It's a form of memory corruption, so the mashup is expected.

technyon commented 1 year ago

If anyone wants to add their own logs, I've uploaded a virtual machine that can compile the firmware. Have a look at the end of the readme.

alexdelprete commented 1 year ago

@alexdelprete It's a form of memory corruption, so the mashup is expected.

memory corruption? I don't think it's a hw bug. But we should agree on what memory corruption is: being a little old, I tend to think about hw issue when referring to memory corruption.

I think it's a sw issue, not a hw issue, also because it happens on different kind of devices.

technyon commented 1 year ago

I think we're talking about a software problem here. In C++ it's easy to write to a memory location by accident, this would be what you refer to as memory corruption. In this case it looks like bytes are changed somehow at the memory location where the MQTT topic is stored.

alexdelprete commented 1 year ago

I agree Jan, it looks like a classic pointer issue or buffer overflow issue of C++. But I guess you already triple-checked where the topic is created and managed. That's why I was thinking about the library/component more than your code. But since it's a well established component, like you said, it would have come out in other projects.

mundschenk-at commented 1 year ago

I didn't intend to imply a hardware issue.

technyon commented 1 year ago

Yes I double and triple checked the code. The most likely candidate would be buildMqttPath() in Network.cpp, that's where all the components of the path are concatenated. I don't see anything going wrong there though. It seems anyway that something else is writing there. For example "maintenance" becomes "locntenance", so something must have written "loc" to that memory location. Why that happens is a mystery.

mundschenk-at commented 1 year ago

Any possibility of reentrancy issues? (I really don't know much about ESP32 programming.)

technyon commented 1 year ago

Actually yes, still reading:

https://github.com/knolleary/pubsubclient/issues/832

Could indeed be a library problem. As far as I understand if a an incoming messages is received while publishing, it can corrupt memory.

technyon commented 1 year ago

I've added the check from the first post in that topic. Anyone want to give it a try?

nuki_hub-6.11-tryfix-mqtt-1.zip

mundschenk-at commented 1 year ago

I've cleaned out the MQTT topics again and installed the new firmware. Hopefully, there will be no corrupted topics next week.