Closed mhaberler closed 5 months ago
I also tried MQTTX, very similar symptom: https://static.mah.priv.at/cors/pcap/mqttx.pcapng packet 697
Thanks for raising this. It could actually be a bug in PicoWebsocket, but it might be hard to reproduce without PicoMQTT.
It looks like in packet 776 in the first capture file there's a websocket frame with invalid length. The Websocket frame length is set to 100 bytes, but data seems to continue beyond that.
The client expects another websocket frame header after the first 100 bytes, but it receives payload data, which can't be decoded.
So this could mean that the bug is related to how the payload length is encoded. I can look into that.
If you're using ESP32, can you check if publishes are always fired from the same core/thread/task as the one which calls the loop()
method? I'm asking because the payload data looks like measurements captured via BLE. Depending on how this is implemented, the BLE advertisement callbacks on the ESP32 can be called from the 2nd core. This can easily lead to problems like the ones we're observing here, because PicoMQTT and PicoWebsocket are not reentrant.
re ESP32: no, that cant be it
the BLE ads are passed into a ringbuffer (probably on core0) which are read and decoded, then published from the main loop() which is core1
the payload for packed 776 was published here
oh, now that you say it I smell something.. stop the presses ;)
the second level interrupt handler is a FreeRTOS thread, and is not bound to a core it also calls the GPS polling code
so it might well be the case what you're suspicing
let me tack that thread to core1 or move the GPS polling code to loop() one way or the other
moved that code to be called by loop() and no, pretty sure this was not the cause
the error reappears with other publishes
interestingly a similar issue arises when using MQTT-over-TCP with the same code
see this pcap https://static.mah.priv.at/cors/pcap/tcp.pcapng packet 710 onwards
I guess this excludes the Websockets code for now
well this is strange.. I switched to the PicoSettings begin/live scheme which activates settings subscriptions later in the startup process, and gets around the chicken-and-egg problem https://github.com/mhaberler/PicoSettings/issues/1
and I cannot observe the errors reported above any more, no more disconnects
MQTTX, MQTT Explorer and a web client connected, turned up reporting rates.. nothing
will let run over night and check in the morning
all that changed is - the Settings subscribes and publishes happen later in the setup phase
at this point I am inconclusive - after the mentioned change in PicoSettings I cannot observe timeouts as before
I have no idea what could be at fault and what "fixed" it - maybe initialisation order
I'll close this issue and reopen once I have a better handle on it
You closed before I could send more tips... Here are a few thoughts:
begin()
. Subscriptions can actually be added at any time. The only place where they shouldn't be changed is when one of the callbacks is currently running (so within the callbacks themselves). I understand that delaying the subscriptions until after the call to begin()
seemed to fix the problem, but I think it's just a coincidence.mutexes
. It introduces basic thread-safety. A mutex is locked in loop()
and when any publish is in progress. Give it a try and see what happens. If PicoMQTT is actually used from multiple threads and we're missing it, this might magically fix it or cause a deadlock. PicoSettings:SettingsBase
non-copyable and non-copy constructible (defined here. The subscription callbacks of its subclasses capture this
, which can be pointing to the wrong place after such operations. Unsubscribing in the destructor is also a good idea. This is probably unrelated, but still worth fixing.thanks! I'll go through the callbacks just to make sure
re ringbuffer: I use the ESP-IDF ringbuffer, for instance here and here - very nice zero-copy scheme and robust, never given me troubles
mutexes
branch - thanks, will give it a try. The timer task and the BLE task are always suspicious
PicoSettings - will give it a stab
in case someone else has this problem: I had a second such case due to optimistic coding..
I tried the mutexes
branch and it immediately deadlocked at the lock in begin_publish()
very easy to find in the debugger and fix - recommended
a plea for help: I am unable to fix this, either master
or mutexes
branch - and I've used every tool in my box
I took pains to make sure only loopTask will call mqtt.* methods I'm using Arduino 3.0.1 with libraries lifted from Tasmota but that seems to have no discernible impact
the symptoms:
here's a pcap
not sure what to do; running short on time for a demo :-/
the current status is here - it might be too convoluted to build
I can try a pared-down example to reproduce if you're stuck
edit: seems I have a moving target - getting these errors with MQTT-over-TCP now and then:
IncomingPacket read incorrect number of bytes: 0/2
..
edit: I moved the very same code to a different ESP32S3 (the above is from an M5StackCoreS3) and the whole thing runs without a hitch
please put on hold until I know better
I might have looked like an idiot above, but I finally got it working on any ESP32-S3 and C3 device.
It was a most unlikely cause: the M5Unified/M5GFX support library.
The symptoms above only show when M5Unified is in play - it works perfectly if one runs using "pure Arduino".
I have no idea yet what the causality is, but I am positive - the last remaining difference.
if the M5Unified is at fault, the symptom might show with a stock example - need to try.
No worries, great that we know what component caused it.
Do you still need help with this?
no, I do not thinks so - we can close for now and reopen if something surfaces
thanks!
Sure, no problem
I cant quite make sense of this, and where this belongs
Setup: MQTT Explorer master commit 65b86a built with:
PicoMQTT is 172.16.0.159, commit 460c258 on an ESP32-S3 PicoWebsockets commit 42092f MQTT Explorer is 172.16.0.191
Traffic ca maybe 30-50 publishes/sec, memory not an issue (> 190kB free heap)
PicoMQTT console logs:
also:
MQTT Explorer console log:
Pcap: https://static.mah.priv.at/cors/pcap/mqttws-explorer.pcapng
NB from packet 774 onwards strange things happen: