Closed zcattacz closed 1 year ago
On ESP32 I can't replicate this issue with the same script as above:
there is no missing line, there is no broken serial connection. everything works ok.
though there are some visable occasional backlog on MQTT Explorer
graph.
$ mpremote run range.py
Checking WiFi integrity.
Got reliable connection
Connecting to broker.
Connected to broker.
>0-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=20
>20-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=40
>40-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=60
>60-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=80
>80-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=100
>100-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=120
RAM free 89408 alloc 21760
>120-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=140
>140-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=160
...
>21320-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21340
>21340-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21360
>21360-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21380
RAM free 89216 alloc 21952
>21380-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21400
>21400-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21420
>21420-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19=21440
...
# (sysname='esp32', nodename='esp32', release='1.20.0', version='v1.20.0 on 2023-04-26', machine='ESP32 module with ESP32')
We have never attempted to send messages at this rate. The only explanation I can think of for lines like
>100-0-1-2-4-6-8-10-12-13-14-16-17-18=120
is that serial communications are being disrupted and characters are being dropped. Investigating this extreme use case would be quite challenging. In my view the module is not suitable for very high message rates (or extremely long messages, another issue which has come up).
Not seeking high message rate here. The purpose of the test was to identity a base line for reliable operation. as you can see the burst in test is paced at 2s. In my use case the message reported is driven by event, I can imagine minor situations of such burst. delayed message is not an issue. But I didn't expect the crash on S2. The extreme long message case doesnt apply to this test.
As you can see from the 2nd post the random loss, and finally crash/lockup seems to only happen on S2(2M ram) mini only. Could you give some advice on possible causes on the crash, how to debug?
Just another day, it ... works now on S2, without any change (hardware/software) ...
20pub burst pacing down to @0.2s
20pubs burst pacing down to @0.1s
Now pacing @2s or @0.1s it runs 12K cycles without crash. I'll try the lib and see how it works. I don't see how it could happened yesterday. I pressed reset endless times, also plug and unplug the USB. The board is also powered by external power. Except for a cold full power off, I don't see anything I could have missed... Do you have any similar experience ?
I'm afraid I don't. Our testing was focussed on long term running (weeks) and running under conditions of poor WiFi RSSI with disconnections. Message rates were low, as in the published demos. We never tested bursts of QOS0 pubs. However I can't see anything in the code which makes such usage problematic.
One very general point about hosts with SPIRAM: garbage collect blocks for at least 100ms. That is my measurement: other users have seen twice that. This shouldn't break anything, but it will obviously affect realtime performance. The module performs GC once per second. This is because of its history as a solution for ESP8266. Even if this was removed, GC would happen eventually, so I think regular GC should stay. None of this should affect reliability, but may explain gaps in the bursts of messages.
@peterhinch , GC help the code go through twice more loops on that weird day, maybe it's just that delay you mentioned.
I am integrating a card reader to report id detected, the new IDs are reported immediately, existence IDs are reported at paced interval to monitor their existence. So up to 3-5 pubs in a queue in very short time is inevitable in this case, I just hope the program won't choke to death, though the burst should be in very rare case. To me some stress test is still essential for long term reliable operation.
Two comments:
Hi thank you for the tips.
I think I find the cause now. Under stress, the webrepl activated in main.py just start to get in the way. Not sure how they interacted, but running the script in webrepl (usb not connected, copy+paste 5line by 5line ... seriously why not more lines ...) won't trigger any crash, just some really bad delay. It only crash mpy after ^C is pressed to stop execution.
I tried to use machine.WDT() to reset after crash, but it won't help. Maybe the crash has cause some deeper trouble...
adding werepl.stop()
to the begining of the test script seems to fixe the problem.
I do not use webrepl or any IDE. I recommend a command line approach with mpremote
or, for some purposes, my fork of rshell
.
The board operation is interrupted when 3V3 and USB are both powered. plus my external VCC is set to 3.58v to compensate voltage drop for a separate PCB over a long wire. Thus I prefer wireless connection for minor testing (no USB), serial for complete script debugging (no external power)...
I use an FTDI adaptor in such situations. Having webrepl and the application squabbling over the interface has always struck me as sub-optimal :)
I think maybe I can cut off VDD line in type-c cable as poor man's USB power isolator :-p, not sure if this going to work, since type-c connector has so many pins not sure how they are connected.
yep, taping the VDD pin in the connector works as expected.
I modified the range.py to make a stress test of mqtt_as on ESP32. The output looks weird and the mpremote serial connection randomly breaks with SerialException, in the meantime the message on mqtt broker stops. Could you help elaberate how this happens and what can be done to make the stress test keep going ?
example output:
On the broker the count dies at
505
. I tried adding sleep, increase, decrease, sleep value but never managed to keep it runing for too long, somewhere under 600, the board always stops responding.