peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
549 stars 116 forks source link

Messages arriving out of order #95

Closed bobveringa closed 1 year ago

bobveringa commented 1 year ago

I've been running into an issue where messages arrive in a different order than they are sent. I have net yet been able to find the cause of this issue, but I am able to reproduce it consistently. Adding a delay between messages reduces the frequency that this issue occurs but does not solve it, unless the delay is increased to something like 500ms. Setting the QoS to 1 also reduces it, but does not seem to solve the issue entirely.

The data does seem to indicate that it has something to do with asynchronous behavior, as older software versions using synchronous MQTT drivers, do not appear to have this issue.

At this point, I am out of ideas. Is there any reason why this library could cause messages to arrive out of order? Or is this caused by the esp32 and non-blocking sockets?

peterhinch commented 1 year ago

This can happen with qos==1 messages under conditions of unreliable communication - duplicate messages can also occur. If a message PUBACK is lost, the sender will repeat the message so the receiver sees a dupe. If a message is lost, the sender may have already sent another before it registers the lack of an PUBACK from the first message and re-transmits. This behaviour is allowed under the MQTT spec. qos==2 provides a guarantee that the message will be received exactly once. I doubt it fixes the out-of-order issue but I'm no expert on this mode.

I'm not sure how qos==0 messages can be received out of order. The code which runs the callback is here. Only one instance of .wait_msg can exist because of the lock here.

I've never tested at the high message rates you're using. What is your normal rate?

The only mechanism for this that I can see is the obvious one of messages arriving in the wrong order, perhaps due to a broker issue? Can the broker cope with the rate? I wonder if a tool like wireshark might help?

It may be that uasyncio is running flat-out with 100% utilisation. You might want to use micropython-monitor (or otherwise) to establish how much headroom you have. Running flat-out doesn't explain the symptom, but it has various implications such as wait_ms intervals that grossly exceed the stated value.

bobveringa commented 1 year ago

I should have clarified that this is happening when sending messages from the device to the broker.

I've never tested at the high message rates you're using. What is your normal rate?

Normally, it is not an issue (it happens just very rarely), in this case the device sends maybe 1 message a second. It only becomes an issue when sending messages in short bursts (after a short period without Wi-Fi, for example). In these bursts, the message rate is a lot higher, probably 10 to 20 messages a second. But it all depends on latency, the real message rate is 1 / (time to transmit message) as we don't call the send message function until the previous call returns.

The only mechanism for this that I can see is the obvious one of messages arriving in the wrong order, perhaps due to a broker issue? Can the broker cope with the rate?

We are using AWS IoT Core, as broker, so I am assuming that the broker can cope. But it might have something to do with how Amazon handles processing the data internally, but this feels a bit like a copout, as this issue is not present when sending messages synchronously.

I wonder if a tool like wireshark might help?

All the traffic to and from the broker is encrypted. This makes it is difficult to figure out which packets belong to which messages. Maybe I could figure out some way to match them.

Our code analysis also found that the driver does not cause the issue (at least explicitly) as everything is locked when transmitting data.

I was not aware of the existence of micropython-monitor, but I'll use that to investigate this issue further (and other issues because it looks very useful)

Thanks for your input. If I find anything I'll let you know.

peterhinch commented 1 year ago

we don't call the send message function until the previous call returns

This really should keep things in order. I'm baffled at the moment.

bobveringa commented 1 year ago

After spending some (a lot) more time investigating this, we finally found some answers. We did some tests with Wireshark packets of predefined length, so we could recognize them in the traffic with Wireshark. Whenever there are multiple MQTT messages in the same TLS packet/frame, the MQTT would arrive out of order on the broker side. This seems to be an error in the processing of our broker. It seems that the esp32 (and maybe TLS, I am not familiar with it) prefers packets sizes of around 1500 bytes, and this seems to cause some issues with our broker. I am uncertain if this impacts any other users, as I do not know what AWS IoT core uses under the hood.

We are currently investigating ways we could resolve this issue, either to changes to the library or to the way we handle messages internally.

@peterhinch If you have any input or suggestion for changes that could fix this within the library, they would be most welcome.

peterhinch commented 1 year ago

I'm afraid this is well above my pay grade :) I have no knowledge of how TLS works internally, nor a detailed knowledge of ESP32 internals..

An interesting test would be to try a different platform: Pico W or Pyboard D, to see if this might be an issue with the ESP32 firmware/SDK. It does seem plausible that somewhere in the chain packets with the same destination and arriving in quick succession get buffered and concatenated. This would have the effect of making them arrive even more closely. OTOH such a test wouldn't really point you at a solution.

You've proved that packets are arriving in the right order and that the fault is in the broker; any tweaking of the way in which they are sent seems to be something of a hack. If you managed to get something working would you have confidence in its resilience?

bobveringa commented 1 year ago

If you managed to get something working would you have confidence in its resilience?

That would depend on how much of a hack it is 😄, but probably not a lot. Internally, we are leaning towards merging messages ourselves before sending them over MQTT. Any other solution, like maybe flushing the socket, seems more of a workaround than a long-term solution.

An interesting test would be to try a different platform: Pico W or Pyboard D

This would certainly be interesting, if this behavior is present on multiple platforms and perhaps with multiple brokers. Currently, I don't have time to investigate it any further. But maybe if I have some spare time in the future I can revisit this.

Anyway, thanks for your help, closing the issue.