peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
549 stars 116 forks source link

Weird behavior and lockup on ESP32-S2 Mini #110

Closed zcattacz closed 1 year ago

zcattacz commented 1 year ago

I modified the range.py to make a stress test of mqtt_as on ESP32. The output looks weird and the mpremote serial connection randomly breaks with SerialException, in the meantime the message on mqtt broker stops. Could you help elaberate how this happens and what can be done to make the stress test keep going ?

async def main(client):
    try:
        await client.connect()
    except OSError:
        print('Connection failed.')
        return
    #for task in (up, down, messages):
    #    asyncio.create_task(task(client))
    # with/without these tasks it doesn't make difference.
    await client.up.wait()
    client.up.clear()
    n = 0
    while True:
        await asyncio.sleep(2)
        print(f'>{n}',end="")
        for i in range(0, 20):
            print(f"-{i}", end="")
            # If WiFi is down the following will pause for the duration.
            await client.publish(TOPIC, "{" +
                                 '"id": {}, "repubs": {}, "outages": {}'.format(n, client.REPUB_COUNT, outages)
                                 +'}', qos = 0)
            await asyncio.sleep(0)  # doesn't seem to affect the out of order print(i)s
            n += 1
            # await asyncio.sleep(0) 
        print(f"={n}")
        # await asyncio.sleep(0) # with this, upto [160~200)
        gc.collect() # with this upto [500~600)

example output:

$ mpremote run range.py 
Checking WiFi integrity.
Got reliable connection
Connecting to broker.
Connected to broker.
>0-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=20
>20-0-1-2-3-4-5-6-8-9-10-11-12-14-15-16-18=40
>40-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-17-18-19=60
>60-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=80
>80-0-1-3-4-5-6-7-8-9-11-12-13-15-16-17-19
>100-0-1-2-4-6-8-10-12-13-14-16-17-18=120
RAM free 2027472 alloc 21616
>120-0-1-2-3-4-5-7-8-9-10-11-12-13-14-15-16-17-19
>140-0-1-2-3-4-5-6-7-8-10-11-12-13-14-15-16-18-19=160
>160-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19
>180-0-2-3-4-5-6-8-10-11-12-14-15-16-17-18-19=200
>200-0-1-2-3-5-7-8-9-11-12-13-15-16-17-18-19=220
>220-0-1-2-4-5-6-8-10-12-13-14-16-17-18-19=240
>240-0-1-2-3-4-5-6RAM free 2026992 alloc 22096
-7-8-9-11-12-13-14-15-16-17-19=260
>260-0-1-2-4-5-6-8-9-10-11-12-13-14-16-18-19=280
>280-0-1-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19
-0-2-4-6-8-9-10-12-13-14-15-16-17-18-19=320
>320-0-1-2-3-4-5-6-7-8-9-11-12-13-14-15-16-17-18-19
>340-0-1-2-4-5-6-8-10-12-14Traceback (most recent call last):
  File "/home//.local/bin/mpremote", line 8, in <module>
    sys.exit(main())
  File "/home//.local/lib/python3.9/site-packages/mpremote/main.py", line 502, in main
    handler_func(state, args)
  File "/home//.local/lib/python3.9/site-packages/mpremote/commands.py", line 217, in do_run
    _do_execbuffer(state, buf, args.follow)
  File "/home//.local/lib/python3.9/site-packages/mpremote/commands.py", line 190, in _do_execbuffer
    ret, ret_err = state.pyb.follow(timeout=None, data_consumer=pyboard.stdout_write_bytes)
  File "/home//.local/lib/python3.9/site-packages/mpremote/pyboard.py", line 365, in follow
    data = self.read_until(1, b"\x04", timeout=timeout, data_consumer=data_consumer)
  File "/home//.local/lib/python3.9/site-packages/mpremote/pyboard.py", line 310, in read_until
    elif self.serial.inWaiting() > 0:
  File "/usr/lib/python3/dist-packages/serial/serialutil.py", line 594, in inWaiting
    return self.in_waiting
  File "/usr/lib/python3/dist-packages/serial/serialposix.py", line 531, in in_waiting
    s = fcntl.ioctl(self.fd, TIOCINQ, TIOCM_zero_str)
OSError: [Errno 5] Input/output error

On the broker the count dies at 505. I tried adding sleep, increase, decrease, sleep value but never managed to keep it runing for too long, somewhere under 600, the board always stops responding.

platform info:
(sysname='esp32', nodename='esp32', release='1.20.0', version='v1.20.0 on 2023-04-26', machine='LOLIN_S2_MINI with ESP32-S2FN4R2')
zcattacz commented 1 year ago

On ESP32 I can't replicate this issue with the same script as above: there is no missing line, there is no broken serial connection. everything works ok. though there are some visable occasional backlog on MQTT Explorer graph.

$ mpremote run range.py 
Checking WiFi integrity.
Got reliable connection
Connecting to broker.
Connected to broker.
>0-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=20
>20-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=40
>40-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=60
>60-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=80
>80-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=100
>100-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=120
RAM free 89408 alloc 21760
>120-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=140
>140-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=160
...
>21320-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21340
>21340-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21360
>21360-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21380
RAM free 89216 alloc 21952
>21380-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21400
>21400-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21420
>21420-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21440
...

# (sysname='esp32', nodename='esp32', release='1.20.0', version='v1.20.0 on 2023-04-26', machine='ESP32 module with ESP32')

image

peterhinch commented 1 year ago

We have never attempted to send messages at this rate. The only explanation I can think of for lines like >100-0-1-2-4-6-8-10-12-13-14-16-17-18=120 is that serial communications are being disrupted and characters are being dropped. Investigating this extreme use case would be quite challenging. In my view the module is not suitable for very high message rates (or extremely long messages, another issue which has come up).

zcattacz commented 1 year ago

Not seeking high message rate here. The purpose of the test was to identity a base line for reliable operation. as you can see the burst in test is paced at 2s. In my use case the message reported is driven by event, I can imagine minor situations of such burst. delayed message is not an issue. But I didn't expect the crash on S2. The extreme long message case doesnt apply to this test.

As you can see from the 2nd post the random loss, and finally crash/lockup seems to only happen on S2(2M ram) mini only. Could you give some advice on possible causes on the crash, how to debug?

zcattacz commented 1 year ago

Just another day, it ... works now on S2, without any change (hardware/software) ...

20pub burst pacing down to @0.2s image

20pubs burst pacing down to @0.1s image

Now pacing @2s or @0.1s it runs 12K cycles without crash. I'll try the lib and see how it works. I don't see how it could happened yesterday. I pressed reset endless times, also plug and unplug the USB. The board is also powered by external power. Except for a cold full power off, I don't see anything I could have missed... Do you have any similar experience ?

peterhinch commented 1 year ago

I'm afraid I don't. Our testing was focussed on long term running (weeks) and running under conditions of poor WiFi RSSI with disconnections. Message rates were low, as in the published demos. We never tested bursts of QOS0 pubs. However I can't see anything in the code which makes such usage problematic.

One very general point about hosts with SPIRAM: garbage collect blocks for at least 100ms. That is my measurement: other users have seen twice that. This shouldn't break anything, but it will obviously affect realtime performance. The module performs GC once per second. This is because of its history as a solution for ESP8266. Even if this was removed, GC would happen eventually, so I think regular GC should stay. None of this should affect reliability, but may explain gaps in the bursts of messages.

zcattacz commented 1 year ago

@peterhinch , GC help the code go through twice more loops on that weird day, maybe it's just that delay you mentioned.

I am integrating a card reader to report id detected, the new IDs are reported immediately, existence IDs are reported at paced interval to monitor their existence. So up to 3-5 pubs in a queue in very short time is inevitable in this case, I just hope the program won't choke to death, though the burst should be in very rare case. To me some stress test is still essential for long term reliable operation.

peterhinch commented 1 year ago

Two comments:

  1. Given that you are using qos 0 I believe that rapidly repeated publications should work. GC will cause gaps between successive pubs, but should not cause anything to break or pubs to be lost.
  2. However, I may be wrong: as far as I know, you are the first to test this. If so, you could implement a queue. Intended pubs would be put on the queue. A single task would retrieve pubs from the queue and do the publication, pausing for an interval before retrieving the next. This would limit the burst rate.
zcattacz commented 1 year ago

Hi thank you for the tips. I think I find the cause now. Under stress, the webrepl activated in main.py just start to get in the way. Not sure how they interacted, but running the script in webrepl (usb not connected, copy+paste 5line by 5line ... seriously why not more lines ...) won't trigger any crash, just some really bad delay. It only crash mpy after ^C is pressed to stop execution. I tried to use machine.WDT() to reset after crash, but it won't help. Maybe the crash has cause some deeper trouble... adding werepl.stop() to the begining of the test script seems to fixe the problem.

peterhinch commented 1 year ago

I do not use webrepl or any IDE. I recommend a command line approach with mpremote or, for some purposes, my fork of rshell.

zcattacz commented 1 year ago

The board operation is interrupted when 3V3 and USB are both powered. plus my external VCC is set to 3.58v to compensate voltage drop for a separate PCB over a long wire. Thus I prefer wireless connection for minor testing (no USB), serial for complete script debugging (no external power)...

peterhinch commented 1 year ago

I use an FTDI adaptor in such situations. Having webrepl and the application squabbling over the interface has always struck me as sub-optimal :)

zcattacz commented 1 year ago

I think maybe I can cut off VDD line in type-c cable as poor man's USB power isolator :-p, not sure if this going to work, since type-c connector has so many pins not sure how they are connected.

zcattacz commented 1 year ago

yep, taping the VDD pin in the connector works as expected.