peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
549 stars 116 forks source link

"Wifi integrity" check is ad-hoc and power intensive #71

Closed bgamari closed 2 years ago

bgamari commented 2 years ago

Currently MQTTClient.wifi_connect will check five times over a five-second period before concluding that the connection is "stable". However, this check offers little value for the power that it burns. While the check can easily increase the power consumption of an application by an order of magnitude (turning a 0.5 second wake-up into a 5.5 second wake-up), it is not at all hard to find examples of unstable connection conditions which this check would fail to catch.

I suggest that this check be dropped. For cases that require reliable delivery MQTT QoS should be used anyways. For all other cases the check is merely burning power while bringing little if any benefit.

peterhinch commented 2 years ago

This check exists for cases where the client is mobile and may be operating near to the limit of wifi range. If it moves out of range, then slowly moves towards the AP, the check has value. In testing under these conditions we have found it to be effective in determining the likelihood of making a usable connection.

bgamari commented 2 years ago

What I am have a hard time understanding is why MQTT's own reliability mechanisms are not sufficient to handle unreliable connections. MQTT has carefully-designed mechanisms for ensuring delivery, no further ad-hoc logic should be needed. While QoS 2 semantics are in general not possible to achieve in all cases, QoS 1 should be feasible using only the protocol itself even on a lossy connection.

peterhinch commented 2 years ago

That is what the author of the official library thought.

MQTT assumes the TCP/IP guarantee of eventual message delivery and works unaided on a wired connection. Radio links cannot in principle provide that guarantee. For example consider the case where a client moves out of range while a QoS 1 acknowledge packet is pending. The official library hangs indefinitely in this case, even if the client moves back in range. Bursts of RF interference have similar effects.

Radio communications are entirely different from a wired TCP/IP network. WiFi does a great job of hiding this most of the time, notably with the help of an OS to manage reconnections etc.

bgamari commented 2 years ago

Sure, this is my in fact my point. The "wifi integrity" check can guarantee the integrity of the link no better than TCP/IP itself can. For instance, consider the case where spectrum is perfectly clear while the integrity check is underway yet an noise source spontaneously appears the moment that we go attempt to publish. As far as I can tell, the integrity check does nothing to avoid this case. The only way to deal with such an adversarial situation is with a timeout-and-retry policy.

peterhinch commented 2 years ago

an noise source spontaneously appears the moment that we go attempt to publish

This can occur at any time, regardless of the integrity check. The qos==1 mechanism handles this. The integrity check is to handle the specific case of a mobile client which I outlined above: the aim is to avoid initiating a connection under conditions of very poor connectivity.

The design of mqtt_as does indeed include ad hoc components, but these were included in consequence of a great deal of testing. If you wish to test a solution which assumes TCP/IP "guarantees" feel free to try the official libraries. The mqtt_as library was borne out of my inability to keep these running for more than an hour at a time (under conditions of intermittent radio interference which prevailed here at the time).

I was taken aback by the eventual complexity of the solution, to the point where I started an entirely new project to determine whether significant simplification was possible. Its aim was simply to provide a stream-like connection between a client and a server with guaranteed integrity. In the case where the client moves out of range the guarantee clearly implies an arbitrary time delay. It turned out that very similar mechanisms were required to achieve the guarantee.

In collaboration with Kevin Köck this became micropython-iot.

As a final comment, the testing process was so time-consuming that I am extremely reluctant to introduce any changes which would require its repetition. If you wish to remove the integrity check I suggest you create and maintain your own fork.

bgamari commented 2 years ago

I am extremely reluctant to introduce any changes which would require its repetition. If you wish to remove the integrity check I suggest you create and maintain your own fork.

Fair enough. Thanks for taking the time to explain the reasoning.