peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
549 stars 116 forks source link

Disconnect stability improvements #77

Closed bobveringa closed 2 years ago

bobveringa commented 2 years ago

This PR adds some stability improvements and error handling that help avoid crashes in cases, where it could have been handled gracefully. This stability improvements are primarily related to closing the connection and disconnecting from the broker.

bobveringa commented 2 years ago

I looked back into the git history of the file in the project, but I did not really shine a light on the situation. All it really tells me is that the change was made around 9 months ago.

My best guess is that because there is also an additional disconnect in the close function, this sometimes causes a collision with the disconnect in the keep alive. However, right now, I have nothing to support this train of thought.

It is unlikely that I added the exception handler just because I felt like it, and I have seen the "Already disconnected" message in our logs many times. However, I seem to be unable to reproduce it right now. Another option is that something outside the library causing this state. However, I think that is unlikely as all of our Wi-Fi is handled by the library.

I'll investigate this behavior further.

peterhinch commented 2 years ago

Is it possible that there was a firmware bug that's now been fixed?

bobveringa commented 2 years ago

The sta_if.disconnect() method throws an OSError if the interface is not activated.

>>> import network
>>> sta_if.active(False)
>>> sta_if = network.WLAN()
False
>>> sta_if.disconnect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: Wifi Not Started

When the code for handling this error was written, I probably assumed it was only related to the disconnect and not the interface. I think updating the error message to display something that resembles the actual error is likely appropriate here. Something like "Wi-Fi not started, unable to disconnect interface".

peterhinch commented 2 years ago

I'm happy with that. However I still haven't fully thought through the issue of programmatically disconnecting and reconnecting. Do you think this change should be made:

    def _close(self):
        if self._sock is not None:
            self._sock.close()
            self._sock = None

It would make .disconnect idempotent. I haven't spotted any downsides...

bobveringa commented 2 years ago

That is an interesting change. At first glance, it seems like it shouldn't have any impact. I'll make that change and roll it out to our internal test fleet to see what effects this has.

I'll also start my investigation of using socket.setblocking sometime in the next 2 weeks.

peterhinch commented 2 years ago

An alternative approach:

    async def disconnect(self):
        if self._sock is not None:
            if self._sta_if.active():  # Test rather than try..except ?
                self._sock.write(b"\xe0\0")
                await asyncio.sleep_ms(100)
            self._close()  # Or should it call .close() to shut down the WiFi? See below.
        self._has_connected = False

I don't have strong views on whether to test for active or stick with an exception. But there is a bigger issue.

I'd welcome your views on what .disconnect should actually do :) It should definitely send the disconnect packet and close the socket. Historically that's all it did; also .close just closed the socket but https://github.com/peterhinch/micropython-mqtt/issues/60 caused me to have it shut down the WiFi too.

In general the approach to losses of connectivity in communications links is to shut everything down for a period long enough to ensure that the other end of the interface recognises the outage. This ensures that the link re-starts as at power up with both ends in a known state.

Applying this principle to applications which programmatically disconnect then reconnect I would lean towards having .disconnect also take down the WiFi. But you're in the hot seat of actually developing such an application so I'd welcome your views.

peterhinch commented 2 years ago

I'll also start my investigation of using socket.setblocking sometime in the next 2 weeks.

I'd need a lot of convincing to use blocking sockets anywhere in the library: a primary goal is asynchronous operation.

bobveringa commented 1 year ago

I'll also start my investigation of using socket.setblocking sometime in the next 2 weeks.

I'd need a lot of convincing to use blocking sockets anywhere in the library: a primary goal is asynchronous operation.

Ah, I should have clarified that this is just for the .disconnect() function. I want to attempt to set the socket to blocking to ensure that the disconnect packet arrives. In all other cases, it should be non-blocking.

bobveringa commented 1 year ago

As for the .disconnect() function. The alternative approach is also valid, I am personally of the approach "It is easier to ask for forgiveness than for permission" (when programming not in life). However, your suggested approach is entirely valid, just preference really.

Logically, it seems that .disconnect() should also close the Wi-Fi connection because .connect() also creates a Wi-Fi connection. But I don't know if this would break people's existing applications. And the solution to just call .disconnect(), then .close() seems fine. But logically, it would make more sense if .connect() and .disconnect() were opposites of each other.

In our current application, there is an additional step before closing the connection where we wait 1-2 seconds before calling .close() as the keep alive function might not be closed down. And if the system goes to sleep with this function still running, it can cause some really weird behavior.

I've been looking to expand the .disconnect() function to wait until the keep alive is done before returning. But I haven't found anything that works well yet.

peterhinch commented 1 year ago

I am rather concerned about this. When .disconnect() is launched there may be various tasks running. The only circumstances I have tested are prior to a power down or deepsleep - in these cases the tasks are (obviously) killed.

If this library were designed for repeated disconnect and reconnect it would use a different approach, retaining pointers to all tasks and cancelling them. It's actually an ideal use-case for a task group.

This leaves me with a dilemma. To develop a general solution which copes with repeated disconnects and reconnects would be a significant amount of work. It would require a lot of testing and thinking through a range of use-cases. I don't think I have an appetite for this - while it is an interesting type of application you are the first user to report this requirement. There is a further problem in that uasyncio does not play well with lightsleep on some platforms, notably STM. So a general solution would come with the caveat "may not work at all on your platform". I'm leaning towards amending the docs to caution against using lightsleep.

I think the best that can be done is for me to suggest things you might do, with no commitment to accept them in the library.

In that spirit, here are some ideas for ensuring that _keep_alive has terminated by the time you shut down the WiFi.

  1. Wait at least self._ping_interval ms before shutting down the WiFi. Slow.
  2. Save a reference to the ._keep_alive task and cancel it; issue asyncio.sleep(0) to allow cancellation to occur then shut down the WiFi.
  3. Modify ._keep_alive so that during the long wait it repeatedly checks for .isconnected(), say once per second. Then ensure you wait for > 1s between shutting down the WiFi and going to sleep.

I tend to favour option 3 as per ._memory which checks connectivity once per second - this might be worth incorporating in the library as standard.

However you need to think about any other tasks that may be running at this time too. It seems to me that you need to know, in the worst case, which tasks may be running when you issue .disconnect and to consider whether they need to be shut down before you issue .close.

bobveringa commented 1 year ago

You make some good points here. I did not know that task groups exists, but they seem really cool.

As for the implementation, the "schedule" for uasyncio is already pretty busy. I am uncertain if I want to add yet another system that checks something every second. So option 2 seems like it would fit my use case better.

I'll start to work on it, if I don't have to take apart the whole driver to implement it, I'll submit a PR, and we can discuss the changes. We are always interested in contributing the work back to this library. If it doesn't end up being a fit, then we'll see if we can implement something that is in everyone's interest. We aren't with that many developers, so we don't have a lot of time to keep our own modifications to this library in sync with this repository if there are any changes. This is also why we decided to contribute all the changes we made back to this library and structure them differently to avoid maintenance work on them.

peterhinch commented 1 year ago

I had in mind something like this (untested):

    async def _keep_alive(self):
        while self.isconnected():
            pings_due = ticks_diff(ticks_ms(), self.last_rx) // self._ping_interval
            if pings_due >= 4:
                self.dprint('Reconnect: broker fail.')
                break
            pint = self._ping_interval
            while pint >= 0 and (ok := self.isconnected()):
                await asyncio.sleep(1)
                pint -= 1000
            if not ok:
                break  # quick exit on WiFi outage
            try:
                await self._ping()
            except OSError:
                break
        self._reconnect()  # Broker or WiFi fail.

I plan to push a docs update and a version bump tomorrow.

On the general topic of "asking for forgiveness" I'm aware that this is the accepted wisdom. I'm not convinced that it's always optimum in a microcontroller context. In some cases you might execute a significant amount of code before the exception occurs, where a quick test might forestall it. I also think that testing sometimes produces more readable code.

peterhinch commented 1 year ago

On further thought I think there is a bug here. This delay can be long. In the event of short WiFi outages you could get multiple instances of ._keep_alive running. This would be non-fatal but is obviously incorrect. The ._keep_alive task needs to terminate or be cancelled more quickly in the event of an outage.

I will push an update based on cancellation.

bobveringa commented 1 year ago

I was just about to comment, something similar. I implemented something quick to test, and ran into that .keep_alive can have multiple instances. Furthermore, I was actually in the process of testing to see what effects this has on the state of the device. I suspect little, as we only noticed now.

peterhinch commented 1 year ago

I have pushed an update with a version bump.

The original design took place before uasyncio acquired a working means of task cancellation. I have amended the code to cancel the .keep_alive and ._memory tasks (simplifying the latter). Cancellation occurs on connectivity losses and also to calls to .disconnect.

The test scripts have been modified: as a consequence of your improvements to ._as_read the arguments to subscription callbacks are now bytearray instances. This made the print output rather confusing so I've amended the print statements only.

I'd welcome any comments.

bobveringa commented 1 year ago

This is very close to what I was working on, but you clearly have a much better grasp of how everything in the library works with each other. So there a few thing in there I hadn't though of yet. The version looks like it would take care of everything, I'll update our version with this one and deploy it to our internal fleet. If there are any issues related to this library I'll inform you.

peterhinch commented 1 year ago

Just FYI I tested with ESP8266 to check RAM usage. The figure of free RAM in the docs was 21.8K. It's now 27.4K. A lot has changed since my original measurement and I don't have a record of the firmware used but the bottom line is that I don't think we need to worry about size.

bobveringa commented 1 year ago

@peterhinch I've since rolled out this change to a (large) portion of devices in the field to collect for some statistics. The devices with this update rarely report having MQTT timeout errors, while devices with an older version still have the issue regularly.

I am not sure what could cause the timeout with these fixes. Initially, I thought it could be that the battery of those devices was low enough to cause a timeout, but the battery percentage of the devices with these errors is almost always above 50%, so that does not seem to be the issue.

In any case, the issue occurs so rarely now, that I am not spending any more time investigating it.

peterhinch commented 1 year ago

OK, I'm glad you're getting decent results.