Open davydnorris opened 5 years ago
OK I've now looked at the code and a race condition is occurring when QOS > 0 and the network latency is larger than the time between publish commands. If you publish several messages in quick succession then this doesn't have to be very large at all, and in fact my code is only working by good luck rather than on purpose - if I was publishing several messages with different QOSs then this would break very quickly, and right now if a message fails to publish it will likely report the wrong one has failed.
The problem occurs because the MQTT client only maintains one pending message - when you publish more than one message with QOS > 0 close together, you will end up with more than one message in a pending state waiting for an ACK and you lose the state for all but the last message published.
This is a pretty big architectural/design issue - to solve this you'd need to either:
Either way it's quite a big change to the code.
Turns out the library has a problem with any command that requires a reply. If the timing between messages isn't right you lose all but the last message requiring a response.
The rewritten MQTT in the current RTOS SDK is really neat, and they've implemented an outbox for messages, which is a nice solution, but there's no corresponding version for NonOS.
So I'm currently in the process of porting the new library over
@davydnorris Have you been able to get this fixed, I was wondering whether you had a repo with a fix as I am experiencing the same issue.
@maverickchongo - no, at this point I have given up on this repo as it seems to no longer be actively maintained. I pulled down the ESP32 and ESP8266 RTOS MQTT code and got it to the point where it was almost working with NonOS but they have used the lwip socket API and blocking calls for the transport, which is actually a really badly performing approach.
So I have temporarily gone back to QoS 0 to get my product out the door, and then I am going to write my own completely transport agnostic MQTT based on ideas from the other libraries out there, and on messaging protocol best practice. It will use a non blocking transport layer that works with a generic inbox and outbox. These will both be processed independently using event queues and loops - that will mean both RTOS and NonOS can use the same library.
@davydnorris Just ran into this issue myself. Will your new project be open source? If so I'd be interested in helping out and/or testing for NON-OS
@someburner it absolutely will be open, and I'd love any help offered. Just got to get my sensor box out on some poles for a client and then I'll be looking at this.
The sources I've pulled to date are this library, the new RTOS and IDF versions (both ESP8266 and ESP32), the Eclipse paho sources, and the WolfMQTT sources.
My architectural goals for the new library are:
Of course, if this lib could easily be hacked to include an inbox/outbox so that QoS 1 and 2 could be implemented properly then that may do us for now, but there's a lot that is good about the new IDF implementation
I'm currently tracking down a problem with publishing and I think I've hit a possible race condition in the send where a second publish message sent at the wrong time will stop the callbacks from the first happening. This is happening for QOS > 0 - when I look I can see the data from the first send actually makes it to the platform, but it appears the code for the rest of the QOS1 or 2 transaction never happens.
I am trying to do transaction counting in order to work out when I can sleep and if I leave the unit running it all seems fine, which is a bit strange because initially I send 3-4 publish messages all at once and it works fine. These gradually become separated over time and it's when they start to drift apart that they mess each other up. The big problem for me is that the publish callback is not getting called and so my transaction count is not being decremented properly.
I'll capture a log and attach.