Closed alexdelprete closed 1 year ago
Thanks, I'll look into it.
Thanks to you.
Another weird thing, that might be related: I found some strange topics in MQTT that were similar to nuki_hub topics, but with parts missing: _hub_hub with battery info, then I found another weird topic, I made a screenshot yesterday but can't find it...I'll'attach it as soon as I find it. It's like there was some problem in sending data to MQTT from Nuki Hub. It only happened 2 times...I'm on latest version of the firmware.
UPDATE: found the screenshot. :)
Now that you mention it, I have several of those too. Looks like there is some memory corruption going on somewhere (which would also explain why Mosquitto complains about malformed messages every once in a while).
I hope it's a sw bug that will be solved. The inconsistency in the status of the Lock I noticed could also depend on this.
There's nothing obvious, the state and binaryState are always updated together. If it's an MQTT problem, it'll get more complicated. I don't have any such malformed topics, but I'm not using mosquitto. As far as I can see the correct topic is passed to the MQTT stack (pubsubclient), what happens from there I don't know since it's not my code.
Is it possible that memory on the ESP32 is getting corruped, maybe by a parallel thread? (I am thinking of the WiFi stack in particular, as that would be something you would not experience on your own device?)
I've reviewed the code building MQTT paths. It's a rather simple method, still there were some stupid mistakes which should be fixed now, you can try release 6.1. If this fixes the actual issue with the binary state I don't know, please test.
I don't have any such malformed topics, but I'm not using mosquitto.
I don't use mosquitto either, I use EMQX v5. So same issue with 2 different brokers.
there were some stupid mistakes which should be fixed now, you can try release 6.1. If this fixes the actual issue with the binary state I don't know, please test.
That's what I hoped: some sw bug that could easily be solved. Thank you.
Strange that it did it never occur to you, and also strange why did it occur to me only after weeks of using Nuki Hub. Usually these stupid bugs trigger more often. :)
If this fixes the actual issue with the binary state I don't know, please test.
That inconsistency condition only happened 1 time since I installed Nuki Hub, so I will monitor things and report back in case I observe it again. What allows me to be optimistic is that I observed those malformed topics and the inconsistency in the same time-frame.
Will keep you poste, thank you for reacting so quickly.
After fixing those errors, I'm somewhat surprised that it did work before. I'm using iobrokers built-in MQTT broker, maybe it handles garbled messages/paths better than mosquitto.
I'm using iobrokers built-in MQTT broker, maybe it handles garbled messages/paths better than mosquitto.
I think you have to follow a strict protocol/spec in order for an MQTT broker to accept data, so if the integrated broker is a true MQTT broker, it means it received the topics malformed, but correct from an MQTT spec perspective, and so it should process them as EMQX and Mosquitto did.
The fact that it worked before also for me for almost a month will remain a magic mistery. :)
I have some other issues with the web interface (also with previous 6.0 fw), pages render with no formatting often, if I refresh a couple of times I see the formatting (black background, etc.), but sometimes it throws an http error (a connection reset error) then after 2-3s it appears again. It seems like http connections are not managed very well...that's the feeling.
Thanks a lot for your support.
Probably related to issue #48, let's follow up there.
Im still seeing this issue:
front-door/lock/state unlocked
front-door/lock/binaryState locked
Using 6.1
Im still seeing this issue:
didn't occur to me yet, after 6.1. But I used nuki_hub for quite some time without the issue, then one day it happened...when I started seeing malformed topics in mqtt.
Do you have malformed topics? open all nuki_hub subtopics to check for weird names...
Let's keep in mind that the malformed topics aren't necessarily related to the incosistent states.
For me it seems te be related to having multiple nuki locks: https://github.com/technyon/nuki_hub/issues/53 I have moved one back to another project and now I don't see this anymore.
Let's keep in mind that the malformed topics aren't necessarily related to the incosistent states.
Correct, we don't know, that's why I asked, so we can check for patterns.
I have updated my device to 6.1 and cleaned out all the messed up topics using mosquitto_pub
. After a few days, it looks like new broken topics have appeared. Is it possible there is still an issue somewhere else in the MQTT code?
The question is where this happens. As far as I can tell, the paths should be built correctly now. After that comes the MQTT library. Just to verify everything is correct on my side, I've built a version that outputs all paths to the serial console.
I have updated my device to 6.1 and cleaned out all the messed up topics using
mosquitto_pub
. After a few days, it looks like new broken topics have appeared. Is it possible there is still an issue somewhere else in the MQTT code?
Since 6.1 I haven't seen malformed topics, but I guess it's too soon to say it's fixed. I'll keep monitoring it and keep you posted.
Unfortunately, for me they are still there. To make sure they are not artifacts, I cleaned up again after posting https://github.com/technyon/nuki_hub/issues/51#issuecomment-1336360175 and this is a screenshot of today:
Few other examples from 6.7: I am cleaning everything again. Could it be related to automated or triggered restarts?
I am cleaning everything again. Could it be related to automated or triggered restarts?
So the bug is still there. I just checked mine (6.8), and very interestingly, the same Topic is malformed, can't be a coincidence:
I also noticed this rssice
topic with a child topic...doesn't seem normal:
I attempted to find issues in the code but the only thing I found that could cause memory issues is a RSSI value that would not be not integer (not even sure that this possible). The itoa() would cause unexpected results. https://github.com/technyon/nuki_hub/blob/master/Network.cpp#L376
Or maybe it's another bug in the Network::buildMqttPath
function.
It would mean that checking for end of strings with "0x00" would not always work or that the fact that the generated path itself would cause issues because its array is too big and padded with "0x00". I am not familiar with this method of initializing string arrays with 0x00 so sorry if I say stupid things. https://github.com/technyon/nuki_hub/blame/master/Network.cpp#L303
@Mincka In C (and derived languages like C++), strings are pointers to zero-terminated arrays of characters.
It's a mystery to me why this is happening. There's nothing too special about the rssi node, it's published via the publishInt() method, which is used by for a few other nodes too (batteryDrain, lockDistance, ledBrightness). Maybe the bug is triggered more easily with the rssi node because it's constantly updated ... same goes for wifiRssi and presenceDetection though.
I have the issue with other nodes as well (see previous screenshots). It is not limited to any single topic.
@technyon Do you have this issue yourself? Maybe we have something else in common that can be causing this. @giejay was thinking about something related to multiple locks. I have two locks with one ESP M5 Atom per lock. Do you have multiple locks @mundschenk-at and @alexdelprete ?
Only one lock. Also, the ESP can only be paired to one lock anyway.
I just saw this in the log:
The ESP rebooted after the configured 30 minutes restart. I'm thinking about a corrupted MQTT command because the ESP is restarted during the publishing of a value. It would happen more frequently with wifiRssi
than others because there are more chances that every 30 minutes, when the device is rebooting, it's also updating wifiRssi
.
I disabled Restart timer on both devices and will report back if I still have the issue.
Do you have multiple locks @mundschenk-at and @alexdelprete ?
only one lock.
what about external libraries? how does the sw interface the mqtt broker? could it be a bug in the library used for that scope?
It is not limited to any single topic.
true. it's not the same topic. I found 4-5 different topics and subtopics malformed.
NUKI Hub uses the wiedely used PubSubClient to act as an MQTT client.
Does it use TCP to communicate with the broker? If so, corruption can't happen at that level, so it must be the buffer/parameters being passed to PubSubClient originally that sometimes is malformed. The problem is that you can't reproduce the issue to debug it. :(
@alexdelprete I don't know if you have the automated restart schedule but if this the case, you can try to disable it, delete your MQTT topics and see if it's better after a few days.
The corruption could happen at several layers, I think it's unlikely though that it happens in PubSubClient, the Wifi or TCP stack, that code is used in many other projects, I think it would be fixed by now. Maybe it makes sense to log the topic on each publish to see if anything malformed is passed to PubSubClient, I can build a binary.
I don't know if you have the automated restart schedule
Nope, I have it disabled (-1).
Maybe it makes sense to log the topic on each publish to see if anything malformed is passed to PubSubClient, I can build a binary.
Sounds like a very good idea to understand what is going on.
Here you go. It logs all published topics, only to the serial line though in order to not interfere with MQTT. Not sure how helpful it'll be, because it's a lot of output.
ouch! on serial only? is it possible to write to a network share or a syslog?
I still have the issue: the malformed topic locntenance
seems like a weird merge of the lock topic and the maintenance topic.
@alexdelprete It's a form of memory corruption, so the mashup is expected.
If anyone wants to add their own logs, I've uploaded a virtual machine that can compile the firmware. Have a look at the end of the readme.
@alexdelprete It's a form of memory corruption, so the mashup is expected.
memory corruption? I don't think it's a hw bug. But we should agree on what memory corruption is: being a little old, I tend to think about hw issue when referring to memory corruption.
I think it's a sw issue, not a hw issue, also because it happens on different kind of devices.
I think we're talking about a software problem here. In C++ it's easy to write to a memory location by accident, this would be what you refer to as memory corruption. In this case it looks like bytes are changed somehow at the memory location where the MQTT topic is stored.
I agree Jan, it looks like a classic pointer issue or buffer overflow issue of C++. But I guess you already triple-checked where the topic is created and managed. That's why I was thinking about the library/component more than your code. But since it's a well established component, like you said, it would have come out in other projects.
I didn't intend to imply a hardware issue.
Yes I double and triple checked the code. The most likely candidate would be buildMqttPath() in Network.cpp, that's where all the components of the path are concatenated. I don't see anything going wrong there though. It seems anyway that something else is writing there. For example "maintenance" becomes "locntenance", so something must have written "loc" to that memory location. Why that happens is a mystery.
Any possibility of reentrancy issues? (I really don't know much about ESP32 programming.)
Actually yes, still reading:
https://github.com/knolleary/pubsubclient/issues/832
Could indeed be a library problem. As far as I understand if a an incoming messages is received while publishing, it can corrupt memory.
I've added the check from the first post in that topic. Anyone want to give it a try?
I've cleaned out the MQTT topics again and installed the new firmware. Hopefully, there will be no corrupted topics next week.
Today I noticed that lock sensor in HA was reporting unlocked when Nuki was locked, I checked MQTT and noticed this:
You will notice that
lock/binaryState
isunlocked
butlock/state
islocked
.I restarted Nuki Hub to force it to reset the states from the lock and everything was good again.
BTW: is it possible to add a restart button in the UI? right now I go into settings and hit save to force it to restart)