tuanpmt / esp_mqtt

MQTT client library for ESP8266
http://tuanpm.net/post/esp_mqtt/
MIT License
1.14k stars 400 forks source link

Race condition in all commands that expect a reply (sub/unsub/publish w QoS >0) #164

Open davydnorris opened 5 years ago

davydnorris commented 5 years ago

I'm currently tracking down a problem with publishing and I think I've hit a possible race condition in the send where a second publish message sent at the wrong time will stop the callbacks from the first happening. This is happening for QOS > 0 - when I look I can see the data from the first send actually makes it to the platform, but it appears the code for the rest of the QOS1 or 2 transaction never happens.

I am trying to do transaction counting in order to work out when I can sleep and if I leave the unit running it all seems fine, which is a bit strange because initially I send 3-4 publish messages all at once and it works fine. These gradually become separated over time and it's when they start to drift apart that they mess each other up. The big problem for me is that the publish callback is not getting called and so my transaction count is not being decremented properly.

I'll capture a log and attach.

davydnorris commented 5 years ago

OK I've now looked at the code and a race condition is occurring when QOS > 0 and the network latency is larger than the time between publish commands. If you publish several messages in quick succession then this doesn't have to be very large at all, and in fact my code is only working by good luck rather than on purpose - if I was publishing several messages with different QOSs then this would break very quickly, and right now if a message fails to publish it will likely report the wrong one has failed.

The problem occurs because the MQTT client only maintains one pending message - when you publish more than one message with QOS > 0 close together, you will end up with more than one message in a pending state waiting for an ACK and you lose the state for all but the last message published.

This is a pretty big architectural/design issue - to solve this you'd need to either:

Either way it's quite a big change to the code.

davydnorris commented 5 years ago

Turns out the library has a problem with any command that requires a reply. If the timing between messages isn't right you lose all but the last message requiring a response.

The rewritten MQTT in the current RTOS SDK is really neat, and they've implemented an outbox for messages, which is a nice solution, but there's no corresponding version for NonOS.

So I'm currently in the process of porting the new library over

maverickchongo commented 5 years ago

@davydnorris Have you been able to get this fixed, I was wondering whether you had a repo with a fix as I am experiencing the same issue.

davydnorris commented 5 years ago

@maverickchongo - no, at this point I have given up on this repo as it seems to no longer be actively maintained. I pulled down the ESP32 and ESP8266 RTOS MQTT code and got it to the point where it was almost working with NonOS but they have used the lwip socket API and blocking calls for the transport, which is actually a really badly performing approach.

So I have temporarily gone back to QoS 0 to get my product out the door, and then I am going to write my own completely transport agnostic MQTT based on ideas from the other libraries out there, and on messaging protocol best practice. It will use a non blocking transport layer that works with a generic inbox and outbox. These will both be processed independently using event queues and loops - that will mean both RTOS and NonOS can use the same library.

someburner commented 5 years ago

@davydnorris Just ran into this issue myself. Will your new project be open source? If so I'd be interested in helping out and/or testing for NON-OS

davydnorris commented 5 years ago

@someburner it absolutely will be open, and I'd love any help offered. Just got to get my sensor box out on some poles for a client and then I'll be looking at this.

The sources I've pulled to date are this library, the new RTOS and IDF versions (both ESP8266 and ESP32), the Eclipse paho sources, and the WolfMQTT sources.

My architectural goals for the new library are:

  1. completely thread agnostic.
  2. inbox and outbox use bip buffers so messages or message pieces are all contiguous
  3. because of 2 above, no message copying, no double allocating, everything is in the buffers
  4. because of 3 above, it will need mqtt_message_create(size) and mqtt_message_dispose(id) functions, but I figure that's an acceptable compromise
  5. transport layer agnostic, even more so than the new IDF version
  6. main control loop agnostic, so can be used with callback based, event based, thread based, RTOS or NonOS
  7. non-blocking
davydnorris commented 5 years ago

Of course, if this lib could easily be hacked to include an inbox/outbox so that QoS 1 and 2 could be implemented properly then that may do us for now, but there's a lot that is good about the new IDF implementation