peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
549 stars 116 forks source link

malform packet and subnet traversal #126

Closed zcattacz closed 7 months ago

zcattacz commented 7 months ago

I guess this would be more of discussion than an issue. In my use case the MQTT lib need to send stat data continuously to server meanwhile other async tasks pulling and doing stats (data queued and stats done in one-time tasks) on data from a serial device continuously. The pub message payload throughput is around 400~500 byte data per sec. I see occasional malformed packet at random interval in mosquitto verbos output, causing mosquitto to disconnect client. Looking at the packet in wireshark, the packet is corrupted in several ways. sometimes the long pub payload is truncated at short length, sometimes the topic is misaligned with missing bytes. And as this goes on the ESP32-C2 freeze in hours. I wonder if there is anything I can try be remedy this.

The mpy network stuff seems to have some element beyond linear thinking :_) Another interesting thing I noticed, if the ESP32-C2 and the test server works on the same LAN, the disconnect happens less frequently, but much more frequently if the server is on a different network segment with only one extra hop in the routing path (sadly I can't net sniff on that server). Some times it even has difficulty to connect. Have anyone seen anything like this ? Is there anything I can tweak to stabilize it.

My serial device driving related code can pull and stat at 3reads/2sec and even higher with no problem, in the above tests I only set it to pull 2reads/7sec. It shouldn't put a lot of load on the device. No interrupt code is used.

peterhinch commented 7 months ago

A few general observations.

When the device "freezes" can it respond to a Ctrl-c? If not this suggest a firmware crash. If it does, the traceback might tell us something. [EDIT] Might this be relevant?

zcattacz commented 7 months ago

Thank you for the inputs.

As far as I know nobody has tested on ESP32-C2 - only ESP32-S3 and ESP32. It might be worth running one of the demos for a period to build confidence in the platform.

This was a typo, I was working on ESP32-S2 Mini.

This throughput is very high. You quote 400-500 byts/sec, how many publications/sec?

It varies 1~3 pub/sec, since the 3 topics reports at different intervals.

The primary design aim of mqtt_as was reliable transport of low bandwidth data under adverse wireless conditions. High throughput and large message lengths were not tested.

Acknowledged, thus this's only a discussion.

As I'm sure you're aware TCP/IP should not allow misalignment. There are conceivable code failure modes that might do this if multiple tasks were to simultaneously transmit. There are interlocks to prevent this and we never observed it in testing.

You are correct, I reviewed my code in the last few days, checked all while True style loop, including wifi.py uping.py and my own reader protocol parser, I did found some tight loop possibilities with no asyncio.sleep_ms(0).

This workaround for current connect() lockup https://github.com/micropython/micropython/issues/8326 was also applied to mqtt_as and uping. time.sleep() in wifi.py and uping.py all replaced with asyncio.sleep_ms()

It also turned out there was a typo in editing mqtt_as._publish(), the first dup=0 was changed to dup=1 for unknown reason.

After these steps, the packet corruption was minimized to rarely. But ... it still randomly freeze within hours without hint of any abnormality.

When the device "freezes" can it respond to a Ctrl-c? If not this suggest a firmware crash. If it does, the traceback might tell us something

The device lost USB connection when frozen, node won't respond to ping/arp. WDT is set and fed, but no reset is triggered.

Out of despair, I transferred my code base to a ESP32 generic board for last try, hoping to capture some traceback you talked about (not sure if this expectation is valid just b/c it seems to have separate usb-serial chip ?).

Magically, it has run for 8h+, without problem. I wonder if there is anything peculiar about S2 mini.

Might https://github.com/micropython/micropython/issues/12819 be relevant?

Maybe I'll try 1.19.1 on my S2 mini.

ebolisa commented 7 months ago

Maybe I'll try 1.19.1 on my S2 mini.

That explains why my devices hang up every so often. I solved the issue with a bandaid: doing a reset around 3am :))

peterhinch commented 7 months ago

Reverting to the original ESP32 reference board is a useful debugging approach. There have been issues with the later ESP32 variants such as problems with SPIRAM and with the built-in USB disagreeing with RF output on certain channels. A basic ESP32 is a good reference point.

zcattacz commented 7 months ago

I uploaded my adapted script, it has been test for continuous sending for nights, I think it's ok now. Closing this, thanks for the tips and ideas.