peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
550 stars 117 forks source link

keepalive race condition #41

Closed tve closed 2 years ago

tve commented 4 years ago

Another nit ;-)

Unless the user explicitly sets config['ping_interval'] the timeout where the _keep_alive starts a reconnection process is the same as the keep alive value set at the MQTT protocol level. This produces a bit of a race condition as to whom times out first.

Unless I'm missing something, the two effects of the keep alive mechanism at the MQTT level are for the broker to send the last will message when it discovers that the keep live timed out, and for the broker to close the connection when 1.5x the keep-alive times out. Nothing else.

So when the _keep_alive coro discovers a time-out there's a bit of a race condition between the client reconnecting and the broker discovering the time-out and triggering the last will message. If no last will message is set, the MQTT keep alive mechanism is really pretty irrelevant given the client-side timeout. IMHO for proper last will operation the coro's time-out should be shorter than the MQTT keep-alive time, which can be set explicitly using the ping_interval config param but that's not exactly obvious to someone that just naively sets the last will message.

peterhinch commented 4 years ago

Thank you for your comments. I will review this module in due course.

The new uasyncio is due for imminent release. I expect to be busy with that for some time, given that almost all my published code makes heavy use of it. My tutorial is also likely to need a substantial update.

peterhinch commented 3 years ago

Sorry for the ridiculous delay in my addressing this. I may be missing something, but I'm unconvinced by this proposed change.

[Heavily edited]

I envisage three situations in which timeouts will occur.

  1. A broker outage. Depending on cause LW may or may not be sent. Client will keep trying to reconnect. No race.
  2. A local WiFi outage. This will be detected quickly (~1 second) by ._keep_connected(), the client will disconnect and reconnect when connectivity is re-established. No ping-related race.
  3. The case in contention is an internet outage. The outage nature and duration are critical to examining this issue. It is conceivable that the broker has received the pings, sent the acks, but the acks have gone missing. LW may or may not be sent depending on what has gone missing. (I know this sort of thing shouldn't happen, but under fault conditions? My knowledge of network hardware is inadequate here.)

For a race to occur, the communication issue causing the the broker to miss pings must repair itself at about the keepalive time. If the fault persists for longer, the behaviour of client and broker seem correct: broker sends LW, client keeps trying to reconnect. On the other hand if the outage clears quickly, communication is re-established transparently.

Assume this worst-case duration, and that all communication was lost. At some point in time the client successfully reconnects. From the broker's point of view, an outage has occurred. If it "thinks" the outage is too long, it sends the LW, otherwise it does not. If the client manages to reconnect quickly enough, the LW may not be sent. There is indeed an uncertainty where a physical outage lasts roughly as long as the keepalive value. Crucially it seems inevitable regardless of client timings.

With your proposed change an outage of this exact duration would mean that, at the time when the outage cleared, the client would already be making reconnect attempts when the broker was deciding it was time to send the LW. There is still a race because reconnection is an iterative process. I think marginal timing is inevitable when outage duration ~= keepalive time.

One way to enforce deterministic behaviour is as follows. Client still sends four pings in the KA period. If no response has occurred, either the pings or the acks have been lost. Client now does not "know" if the broker has detected the problem, so it forces an outage. This must last longer than 1.5x the keepalive time before it can be sure that the broker has detected the outage, sent the LW and closed the connection. The client then starts a new connection attempt.

This is decidedly slow. I'm not convinced that the overall issue is a practical problem.