peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
550 stars 117 forks source link

Correct Error Handling for reconnect (better than range.py?) #3

Closed cefn closed 2 years ago

cefn commented 6 years ago

Adopting the error handling in the example code from range.py seems to obviate the reconnection behaviours of micropython-mqtt

Is there an alternative reference example for invoking micropython-mqtt to include error-handling for long-lived applications where transient disconnections are possible? What is the proper invocation to combine a handful of scheduled async functions in such a way that the (implicit) async functions of micropython-mqtt have their errors handled properly, and the application doesn't terminate?

I created a routine based on range.py, but after a very long, and visibly successful test run (e.g. between 20hours and 2 days), it seems to hit unrecoverable connection errors which look like pages and pages of...

LmacRxBlk:1
LmacRxBlk:1
LmacRxBlk:1
LmacRxBlk:1

...followed by cycles of...

state: 5 -> 2 (2c0)
rm 0
LmacRxBlk:1
reconnect
state: 2 -> 0 (0)
scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 1
cnt 
LmacRxBlk:1
LmacRxBlk:1
LmacRxBlk:1
LmacRxBlk:1
state: 5 -> 2 (2c0)

And the application makes no apparent attempt to resume.

Previously though failures happened, the logic of micropython-mqtt seemed to negotiate a reconnect. The change as far as I can see; I added back the finally clause (as per the original example code at range.py ) to trigger client.close() for KeyboardInterrupt (see this commit).

This avoids LmacRxBlk errors when testing interactively (handles keyboard interrupts more gracefully without the repeated errors spewing to the console), but I suspect the finally and client.close() is also triggered when there is a socket error, which then prevents the system from being able to auto-recover from a disconnection.

I am not sure what triggers the disconnection and it seems probabilistic. There may be some configuration option which would improve things. I am not certain whether access-point mode is fully disabled by the invocations I am making, for example. However, I certainly wouldn't expect the ESP8266 to be 'out of range' given the access point is just a couple of metres away.

The 2metreX2metre display, driven by a single ESP8266 (Wemos D1 Mini or equivalent clone), can be seen running in real-time at https://photos.app.goo.gl/14ryl7bCXeYJbNdy2 which gives you an idea of the data rates involved - just the information for 16 segments of 3 'RGB' bytes re-published lazily (only when segments change), and then forwarded by serial from Micropython to an Arduino which controls the lights.

The issue of lost connection can always be resolved by a hard reboot and the display can be back very quickly on reset, then runs for another several hours. However, I have to eliminate the need for a manual reset, and would like the error recovery and any reset to be handled in software instead, and hopefully therefore a bit faster. This is intended to be part of a display of 20 characters, so if they all have a MTBF of 20 hours, that means a glitch every hour. If we encounter any environmental factors which might trigger wifi/network errors more frequently than once a day, I definitely need better error-handling. The shorter and rarer I can get failures to be, the better.

QUESTIONS

I seem to recall there is a fallback option on ESP8266 of wiring an output pin to its own reset, and triggering a hard reset that way. In your experience, is a 'hardware self-reset' the best way to conceive of a long-lived networked device which might not occasionally enter an unrecoverable loop of errors from some OS-level network issue? Seems heavy-handed for just recovering a transient wifi or socket error.

Is there a way to avoid these networking errors in the first place by configuring the application differently (e.g. turning off some Wifi feature)?

Finally is there a better way to instruct the handling of these errors than the finally clause in range.py. Ideally error handling which is designed to keep the application running rather than terminate it?

Application background information

My code was derived from the range.py example, but adding a bit of translation between byte frames dispatched over MQTT and serial frames sent over to 5V Arduino-compatible circuit. The Arduino acts as a 'co-processor' to service large arrays of WS2811 addressable LEDs, but could also potentially take over the role of a reset watchdog if really necessary (e.g. power cycle the ESP8266 if it doesn't get anything over serial for a while).

My current Micropython code is at uartcharactertest.py and the Arduino code is at uartcharacter.ino (although the ESP8266 receive pin isn't even wired, so the Arduino code is strictly 'downstream' and can't have any effect on the micropython board).

The controller (MQTT publisher) and MQTT broker are running on a Pi 3. They remain functional throughout and have run for many days at a time without visible fault (as demonstrated by other desktop-based subscribers to the display information). The wifi network is served by a GL iNet AR150 running OpenWRT which has also never failed so far. Repowering the micropython board is enough to bring back full function in all cases I have encountered, meaning the error exists somewhere in the ESP8266 or Micropython networking stack, or in my error handling (based on range.py).

The raw log of a failure is visible at https://raw.githubusercontent.com/cefn/retrotextual/master/gist/cockle-longrun-networking-error.log - search for LmacRxBlk to find the start of the failure.

peterhinch commented 6 years ago

As you probably know the LmacRxBlk error results from a buffer overflow in the vendor code on the ESP8266. It is a consequence of excessive throughput, either in terms of the raw capability of the chip or in terms of the application failing to retrieve data at a rate to match that at which it's being received. It is the bane of the ESP8266 which really doesn't work well with high incoming data rates. Sometimes if the throughput stops the ESP recovers, other times it stays stuck in that mode and is effectively crashed.

In general a hardware reset via a hardware watchdog is the only sure way to fix a crashed device. This is the solution I adopt in my code which uses an ESP8266 to bring MQTT to a Pyboard: the Pyboard detects that the ESP8266 has stopped responding and physically resets it.

But this mechanism never operates under normal conditions. In my testing the driver copes with WiFi failures and socket timeouts. I have no doubt that it would operate if I increased the throughput to a high enough rate. The ESP8266 really doesn't cope well under these conditions. Ideally there would be some way of testing whether the buffer was filling so that the application could take remedial action (such as pausing). But I'm not aware of any way to do this. In my testing I use low data rates unless I'm deliberately trying to provoke the watchdog.

Regarding the finally clause in range.py its purpose was to prevent LmacRxBlk errors after the test was interrupted with ctrl-C. These were triggered if the ESP8266 continued to receive data after the application was terminated.

In conclusion I think the solutions are (if possible) to reduce throughput or to implement a hardware watchdog. I have a vague memory that there is something you can do with pin 16 and reset to achieve a watchdog but I haven't tried it. If you go this route I'd be interested to hear the outcome.

cefn commented 6 years ago

I think you are saying that the strategy outlined at https://forum.micropython.org/viewtopic.php?t=3602 (and then presumably re-running the client creation and connection logic) doesn't consistently work to recover a networking error of this kind, since occasionally the device remains 'crashed'.

So, the strategies I could adopt...

However, sounds from your comments like a hardware reset is worthwhile as a fallback anyway, even if I am able to get very long runs, there's no guarantee that something or another in the networking stack won't cause a hang.

Any idea what sort of orders of magnitude we are talking about for throughput or buffer size? It strikes me that in the grand scheme of things (e.g. wifi headline rates) I am really sending very little data. The whole display of 20 16-segment characters only takes 3 x 16 x 20 = 960 bytes to define its color, and this is being managed on a lazy-update basis (only sending changed segments and rate-limited). Each character only handles 1/20 of the overall segment data, and only has to keep up with human visual perception (lets say 20 frames a second). Even taking account of the overhead from MQTT seems like not much throughput. I was guessing that Micropython computations to split the topic strings and send over serial might be the main bottleneck (hence going with overclocking via machine.freq()).

peterhinch commented 6 years ago

The forum reference was to a situation where I got the LmacRxBlk at the REPL, which was solved as described. I was initially baffled as to why these errors occurred when the program had been interrupted. Trapping the keyboard interrupt (or any other error) and closing the socket fixed that particular problem.

However subsequent experience has shown that this buffer overflow can occur at runtime. I don't know any way of characterising the buffer behaviour. Since in the worst case it seems to amount to an effective crash of the ESP8266 I can't envisage a purely software solution. If the chip crashes there will be no exception to intercept.

I don't think I can be much help with this one. Most of my testing was done with quite slow updates - short messages every few seconds for the simple reason that the ESP8266 seemed unable to cope with anything much faster. Perhaps I should amend the docs to make this clear. I think MQTT on the ESP8266 is ill-suited to realtime control.

Whether the ESP32 is any better is moot - the port has its own problems, notably an inability to recover from a WiFi outage. In my view that is a complete showstopper.

peterhinch commented 6 years ago

@cefn I see you've studied this in some depth. I'd be grateful for a summary of your conclusions as I'm rather baffled.

cefn commented 6 years ago

I'm certainly trying. I was hoping to report positive results from a long-running test with a GPIO pin wired to reset, but it was unsuccessful. The hardreset() routine does indeed fully reset the board and reconnect everything successfully if I trigger it interactively. Now I just need to be able to call it in response to a networking error. Unfortunately...

When I returned to the test, it was no longer relaying the MQTT data to the display, but wasn't reporting its 'heartbeat' after publishing to its liveness topic either. I ran GNU screen and pressing <Enter> gave me no prompt, until I pressed <CTRL+C> which returned me to an interactive prompt after printing the Micropython version and all the chevrons from the enter characters I had sent. This suggests the mqttTask had died, but the drawTask (and mqtt_as tasks) were still alive, meaning run_forever() was still going, and hadn't raised any error to trigger the finally clause to call hardreset().

Because of the logging issues, I wasn't able to see the historical error reporting from when the mqttTask died to know what kind of error took place.

BTW I have now figured out that uasyncio doesn't deal in tasks, so I can see the mqttTask.cancel() logic can never be made to work, though I am not sure what my alternatives are for being able to usefully back out from an async routine, triggering the finally clauses in a co-routine and re-running it again.

The main thing is that I am not gathering all the asyncio coroutines in the proper way to be able to handle the errors arising from them. In particular I can't find the equivalent of Cpython asyncio.gather() which combines together the 'result' of multiple coroutines, with the potential to detect and abort on any one of them throwing an error by calling loop.run_until_complete() on the gathered task. Alternatively, something like asyncio.wait() with return_when=FIRST_EXCEPTION which is somewhat equivalent.

https://docs.python.org/3/library/asyncio-task.html#asyncio.gather https://docs.python.org/3/library/asyncio-task.html#asyncio.wait

MicroPython v1.9.3-5-g7c868b52-dirty on 2017-12-16; ESP module with ESP8266
Type "help()" for more information.
>>> import uasyncio
>>> uasyncio.wait
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'wait'
>>> uasyncio.gather
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'gather'

How do I block on asyncio in my main(), allowing cooperative scheduling to run like loop.run_forever(), but also waiting on the first exception raised by any one of multiple specific coroutines as per gather or wait from cpython? As I understand it, loop.run_until_complete() can only block until the result of a single coroutine, while loop.run_forever() waits on all coroutines, but will silently swallow errors until they have completed successfully or unsuccessfully, and neither is the needed behaviour.

Currently my code is like...

def run():
    try:
        while True:
            try:
                userInterrupt = False
                service()
            except KeyboardInterrupt:
                userInterrupt = True
                break
            except Exception as e: # in future try to handle some exceptions and re-run service()?
                sys.print_exception(e)
                raise
    finally:
        if not userInterrupt:
            hardreset() # handle case of some kind of unexpected (network) failure

Where service is like...

def service():

    # ensure wifi connection
    checkwifi(ssid, auth)

    try:
        mqttTask = loop.create_task(serviceMqtt())
        drawTask = loop.create_task(serviceDrawing())
        loop.run_forever()
    except Exception as e:
        sys.print_exception(e)
        raise
    finally:
        if mqttTask: mqttTask.cancel()
        if drawTask: drawTask.cancel()

See the full code

In the future, I hoped the service() routine would be re-run on recoverable errors before resorting to hardreset(). In any case, the current implementation was intended to bail to hardreset() on anything except KeyboardInterrupt. However, the hardreset() clause seems to be never arrived at.

So my invocation of loop.run_forever() doesn't do the right kind of exception-raising from the multiple async coroutines, but I don't know what does.

peterhinch commented 6 years ago

I haven't time right now to study the detail, but task cancellation is crucial in applications of this type. It can now be done and I have just finished working on a mechanism to achieve it. I suggest you revisit my uasyncio library and also look at the latest update to my Pyboard MQTT library: the code and example programs use this mechanism.

As I understand it uasyncio will never support CPython's Task and Future classes which are seen as bloated. My approach is intended as a lightweight way to achieve task cancellation using the new code in uasyncio designed to support timeouts.

I'll look at your code in more detail probably tomorrow.

Pete

peterhinch commented 6 years ago

I think the best approach is to handle exceptions within the coroutine in which they occur and to use another means to communicate to the main task the fact that an exception has occurred. The synchronisation primitives (notably the Event class) may be used for this purpose. Alternatively tasks can communicate via global variables or, if they are class members, via bound variables or methods.