mlesniew / PicoMQTT

ESP MQTT client and broker library
GNU Lesser General Public License v3.0
236 stars 25 forks source link

Websocket disconnects #36

Closed mhaberler closed 4 months ago

mhaberler commented 4 months ago

I cant quite make sense of this, and where this belongs

Setup: MQTT Explorer master commit 65b86a built with:

yarn
yarn build
yarn start

PicoMQTT is 172.16.0.159, commit 460c258 on an ESP32-S3 PicoWebsockets commit 42092f MQTT Explorer is 172.16.0.191

Traffic ca maybe 30-50 publishes/sec, memory not an issue (> 190kB free heap)

PicoMQTT console logs:

[612870][D][NetworkClient.cpp:568] connected(): Disconnected: RES: 0, ERR: 128
[616862][D][NetworkClient.cpp:568] connected(): Disconnected: RES: 0, ERR: 128

also:

IncomingPacket read incorrect number of bytes: 0/2
IncomingPacket read incorrect number of bytes: 0/2

MQTT Explorer console log:

RangeError: Invalid WebSocket frame: RSV2 and RSV3 must be clear
    at Receiver.getInfo (/Users/mah/Ballon/src/BalloonWare/networking/MQTT-Explorer/node_modules/ws/lib/receiver.js:171:14)
    at Receiver.startLoop (/Users/mah/Ballon/src/BalloonWare/networking/MQTT-Explorer/node_modules/ws/lib/receiver.js:131:22)
    at Receiver._write (/Users/mah/Ballon/src/BalloonWare/networking/MQTT-Explorer/node_modules/ws/lib/receiver.js:78:10)
    at writeOrBuffer (node:internal/streams/writable:447:12)
    at _write (node:internal/streams/writable:389:10)
    at Writable.write (node:internal/streams/writable:393:10)
    at Socket.socketOnData (/Users/mah/Ballon/src/BalloonWare/networking/MQTT-Explorer/node_modules/ws/lib/websocket.js:1162:35)
    at Socket.emit (node:events:514:28)
    at addChunk (node:internal/streams/readable:376:12)
    at readableAddChunk (node:internal/streams/readable:349:9) {
  code: 'WS_ERR_UNEXPECTED_RSV_2_3',
  [Symbol(status-code)]: 1002
}

Pcap: https://static.mah.priv.at/cors/pcap/mqttws-explorer.pcapng

NB from packet 774 onwards strange things happen:

mhaberler commented 4 months ago

I also tried MQTTX, very similar symptom: https://static.mah.priv.at/cors/pcap/mqttx.pcapng packet 697

mlesniew commented 4 months ago

Thanks for raising this. It could actually be a bug in PicoWebsocket, but it might be hard to reproduce without PicoMQTT.

It looks like in packet 776 in the first capture file there's a websocket frame with invalid length. The Websocket frame length is set to 100 bytes, but data seems to continue beyond that.

The client expects another websocket frame header after the first 100 bytes, but it receives payload data, which can't be decoded.

So this could mean that the bug is related to how the payload length is encoded. I can look into that.

If you're using ESP32, can you check if publishes are always fired from the same core/thread/task as the one which calls the loop() method? I'm asking because the payload data looks like measurements captured via BLE. Depending on how this is implemented, the BLE advertisement callbacks on the ESP32 can be called from the 2nd core. This can easily lead to problems like the ones we're observing here, because PicoMQTT and PicoWebsocket are not reentrant.

mhaberler commented 4 months ago

re ESP32: no, that cant be it

the BLE ads are passed into a ringbuffer (probably on core0) which are read and decoded, then published from the main loop() which is core1

mhaberler commented 4 months ago

the payload for packed 776 was published here

mhaberler commented 4 months ago

oh, now that you say it I smell something.. stop the presses ;)

the second level interrupt handler is a FreeRTOS thread, and is not bound to a core it also calls the GPS polling code

so it might well be the case what you're suspicing

let me tack that thread to core1 or move the GPS polling code to loop() one way or the other

mhaberler commented 4 months ago

moved that code to be called by loop() and no, pretty sure this was not the cause

the error reappears with other publishes

interestingly a similar issue arises when using MQTT-over-TCP with the same code

see this pcap https://static.mah.priv.at/cors/pcap/tcp.pcapng packet 710 onwards

image
mhaberler commented 4 months ago

I guess this excludes the Websockets code for now

mhaberler commented 4 months ago

well this is strange.. I switched to the PicoSettings begin/live scheme which activates settings subscriptions later in the startup process, and gets around the chicken-and-egg problem https://github.com/mhaberler/PicoSettings/issues/1

and I cannot observe the errors reported above any more, no more disconnects

MQTTX, MQTT Explorer and a web client connected, turned up reporting rates.. nothing

will let run over night and check in the morning

all that changed is - the Settings subscribes and publishes happen later in the setup phase

mhaberler commented 4 months ago

at this point I am inconclusive - after the mentioned change in PicoSettings I cannot observe timeouts as before

I have no idea what could be at fault and what "fixed" it - maybe initialisation order

I'll close this issue and reopen once I have a better handle on it

mlesniew commented 4 months ago

You closed before I could send more tips... Here are a few thoughts:

mhaberler commented 4 months ago

thanks! I'll go through the callbacks just to make sure

re ringbuffer: I use the ESP-IDF ringbuffer, for instance here and here - very nice zero-copy scheme and robust, never given me troubles

mutexes branch - thanks, will give it a try. The timer task and the BLE task are always suspicious

PicoSettings - will give it a stab

mhaberler commented 4 months ago

in case someone else has this problem: I had a second such case due to optimistic coding..

I tried the mutexes branch and it immediately deadlocked at the lock in begin_publish()

very easy to find in the debugger and fix - recommended

mhaberler commented 4 months ago

a plea for help: I am unable to fix this, either master or mutexes branch - and I've used every tool in my box

I took pains to make sure only loopTask will call mqtt.* methods I'm using Arduino 3.0.1 with libraries lifted from Tasmota but that seems to have no discernible impact

the symptoms:

here's a pcap

not sure what to do; running short on time for a demo :-/

the current status is here - it might be too convoluted to build

I can try a pared-down example to reproduce if you're stuck

edit: seems I have a moving target - getting these errors with MQTT-over-TCP now and then:

IncomingPacket read incorrect number of bytes: 0/2
..

edit: I moved the very same code to a different ESP32S3 (the above is from an M5StackCoreS3) and the whole thing runs without a hitch

please put on hold until I know better

mhaberler commented 4 months ago

I might have looked like an idiot above, but I finally got it working on any ESP32-S3 and C3 device.

It was a most unlikely cause: the M5Unified/M5GFX support library.

The symptoms above only show when M5Unified is in play - it works perfectly if one runs using "pure Arduino".

I have no idea yet what the causality is, but I am positive - the last remaining difference.

if the M5Unified is at fault, the symptom might show with a stock example - need to try.

mlesniew commented 4 months ago

No worries, great that we know what component caused it.

Do you still need help with this?

mhaberler commented 4 months ago

no, I do not thinks so - we can close for now and reopen if something surfaces

thanks!

mlesniew commented 4 months ago

Sure, no problem