Closed nerdoug closed 4 months ago
I did 10 tests where I downloaded a test script (gait3x2) with the 202 msec interval and checked 2 things: 1) that no lines in the script were dropped 2) if there were any gaps of 2 check times ( was 202, so looking for gaps of 404) in the arrival times in mqttBroker
In all cases there were no dropped script lines, although I thought I saw that once earlier today.
In 7 of the 10 tests, there was exactly one double length gap in the 44 lines of the script.
I changed the control parameter from 202 to 41 by editing the line in global_variables.cpp that starts with loopTaskVars(checkMqtt,
I'm about to run another 10 script download tests to see what the MQTT arrival times look like.
Did a download test with the MQTT check interval set to 41 msec, and didn't lose any script lines. However, the time intervals between the bot's initial processing of the commands varied frm 41 to 287 msec, always a multiple of 41. The most common interval was 246. Need to put more debug statements in to get more timing info. Attaching spreadsheet with delta times for MQTT command arrivals at bot, and the full console log from the same run
Questions:
Captured tcpdump of broker I/O during a script download (attached) script lines come in at even 200 msec intervals, but are sent to robot at irregular intervals. Looks like broker is waiting for an ack from robot before it sends next msg. Every delayed msg comes right after robot sends an ack to broker. Can't find the robot code that initiates sending of ack (possibly buried in a library) but wondering if ack transmission is delayed by our loop dispatcher that only checks for work to do periodically?
Hmmm. could be tcp/ip acks that are delayed, rather than mqt acks. did traces with qow = 0,1, and 2. Will check to see if they all have the delayed ack issue, which would tend to say it's not at mqtt level.
traces for wireshark analysis: mqttfx-qos.zip
good display filter: ip.addr == 10.0.0.200 or ip.addr == 10.0.0.165 or ip.addr == 10.0.0.186
The player's last IP byte: .200 = broker .165 = laptop (MQTT.fx) .186 = Doug's robot
for qos 0 trace, see frames 99 - 102 for qos 1 trace, see frames 47 - 50 for qos 2 trace, see frames 55 - 58
This might be relevant. Need to read some of the references: https://github.com/eclipse/mosquitto/issues/1590 Nope, don't think that applies to us, because we're not using secure HTTP, and it's not the Windows side that's delaying acks for us.
Need to run some more traces with power saving disabled as in Andrew's earler reference, which was: https://superuser.com/questions/1393936/mqtt-large-delay-between-messages#1394315
made the change to aaNetwork.cpp, using esp_wifi_set_ps (WIFI_PS_NONE); (note initial lower case e) with this near start of file:
// needed per https://superuser.com/questions/1393936/mqtt-large-delay-between-messages#1394315
And made 3 more tcpdump traces on broker using this command: tcpdump -i any -s 65535 -w /tmp/mqttfxM-qos2
think I did a sudo su before that to get superuser power. Had to rename files to have .cap so Wireshark could read them - should have created them with something like: tcpdump -i any -s 65535 -w /tmp/mqttfxM-qos2.cap
following zip file has traces for qos 0, 1, and 2
MQTT broker code always copies incoming message to one of the buffers (always the top one) , after shuffling all other buffers down to make space for it (even if they didn't hold messages.) Added code (in aaStringQueue::push) to output the buffers in use count if it exceeds 1, and this happens frequently with message spacing of 200 msec, but does not occur with 300 msec spacing. Also strange is that the arrival time of the messages that are queued (as seen in aaMqtt.cpp/onMqttMessage() ) is the same. So it looks like the arrival time of the message is changing, which seems strange. should do a simultaneous network capture to see if that's actually the case.
This implies that process of some MQTT messages is taking more than 200 msec. Seems a bit long, and maybe some optimization is in order. Changing message buffering to a ring buff er is a candidate, to reduce the unneeded buffer copying.
Did a tcpdump trace at the same time as I did a MQTT.fx TEST command followed by script download. The delay between script lines was 200 msec, and the dispatching interval for checkMqtt was 41 msec. Here's the network trace, and the serial console output:
The first case of queuing up an MQTT command comes at console time 56637:
H) processCmd{6}[56162]> command received: FLOW onMsg@ 56377 H) processCmd{6}[56408]> command received: FLOW onMsg@ 56637 onMsg@ 56637
Some quick notes before I forget them: -I'm seeing some IPV6 traffic is network traces, but the IP addresses are all ::1. -wondering if Nagles algorithm is slowing down the transmissions from broker to robot. There's a mosquitto option to disable it, which would be worth trying. Info on Nagle's stuff: https://networkencyclopedia.com/nagles-algorithm/
Here's another explanation that may be clearer: https://en.wikipedia.org/wiki/Nagle%27s_algorithm
I have modified the mosquitto configuration to disable Nagles algorithm in 2 places: /etc/mosquitto/mosquitto.conf /etc/mosquitto/conf.d/mosquitto.conf
In each file I aded the text: set_tcp_nodelay true I didn't see any messages in the log saying it was disabled: /var/log/mosquitto/mosquitto.log
I suspect the problem is a well known TCP optimization called delayed ACKs is built in to either Mosquitto or the raspberry TCP/IP stack. I can't find any way to disable it.
After some thought, I'm not sure this is a real problem. My previous observations that we're losing MQTT data was wrong, and were due to my own actions to reduce the work done in the script for simplified debugging, and a error in implementing command identifiers in the milleseconds field of the FLOW command.
To verify this, I cranked the time between MQTT.fx transmission of commands down to 100 msec. Normal operation is that every MQTT command is entered into the multi-buffer structure, and is extracted when the command is processed. This means that having 1 command "queued" is normal. With 200 msec delays between MQTT.fx transmissions, I occasionally saw 2 commands buffered. With 100 msec delays I saw quite a few cases with a queue of 2, and occasionally, a queue of 3. However, I never saw any data loss, as demonstrated by debug displays of what arrived at the robot.
With a delay of 50 msec between MQTT.fx transmissions, we overran the depth of the queuing system. I saw 6 commands queued, and one command was lost. The number of buffers available is controlled by BUFFER_MAX_SIZE in aaStringQueue.cpp, and is set to 5. Seeing this get to 6 confirms our queue was overrun. The other control parameter is COMMAND_MAX_LENGTH which is 200.
I'm going to enter another issue for a review of the buffer handling routines. There seems to be more buffer shuffling, copying, and zeroing than is actually needed. This is sensitive, because much of the buffer handling is asynchronous to loop(), and acts like an interrupt that can happen anywhere.
I did make some code changes while investigating this, and I'll leave the issue open until I clean out the debug stuff and document the actual changes.
In Loop, a check is done every 202 msec to see if there is a queued MQTT message to be processed. We send messages from MQTT.fx scripts every 200 msec, and it makes sense that we check for work more frequently than the work arrival rate. We don't seem to have any problems processing the messages, so it's probably safe to decrease the checking interval.