theAgingApprentice / HexaFloorRide

Arduino based six legged robot project
MIT License
0 stars 0 forks source link

Reduce MQTT processing delay #241

Closed nerdoug closed 4 months ago

nerdoug commented 5 months ago

In Loop, a check is done every 202 msec to see if there is a queued MQTT message to be processed. We send messages from MQTT.fx scripts every 200 msec, and it makes sense that we check for work more frequently than the work arrival rate. We don't seem to have any problems processing the messages, so it's probably safe to decrease the checking interval.

nerdoug commented 5 months ago

I did 10 tests where I downloaded a test script (gait3x2) with the 202 msec interval and checked 2 things: 1) that no lines in the script were dropped 2) if there were any gaps of 2 check times ( was 202, so looking for gaps of 404) in the arrival times in mqttBroker

In all cases there were no dropped script lines, although I thought I saw that once earlier today.

In 7 of the 10 tests, there was exactly one double length gap in the 44 lines of the script.

I changed the control parameter from 202 to 41 by editing the line in global_variables.cpp that starts with loopTaskVars(checkMqtt,

I'm about to run another 10 script download tests to see what the MQTT arrival times look like.

nerdoug commented 5 months ago

Did a download test with the MQTT check interval set to 41 msec, and didn't lose any script lines. However, the time intervals between the bot's initial processing of the commands varied frm 41 to 287 msec, always a multiple of 41. The most common interval was 246. Need to put more debug statements in to get more timing info. Attaching spreadsheet with delta times for MQTT command arrivals at bot, and the full console log from the same run

mqtt-timing-41.ods

mqtt-timing-200-41.txt

nerdoug commented 4 months ago

Questions:

Captured tcpdump of broker I/O during a script download (attached) script lines come in at even 200 msec intervals, but are sent to robot at irregular intervals. Looks like broker is waiting for an ack from robot before it sends next msg. Every delayed msg comes right after robot sends an ack to broker. Can't find the robot code that initiates sending of ack (possibly buried in a library) but wondering if ack transmission is delayed by our loop dispatcher that only checks for work to do periodically?

Hmmm. could be tcp/ip acks that are delayed, rather than mqt acks. did traces with qow = 0,1, and 2. Will check to see if they all have the delayed ack issue, which would tend to say it's not at mqtt level.

traces for wireshark analysis: mqttfx-qos.zip

good display filter: ip.addr == 10.0.0.200 or ip.addr == 10.0.0.165 or ip.addr == 10.0.0.186

The player's last IP byte: .200 = broker .165 = laptop (MQTT.fx) .186 = Doug's robot

for qos 0 trace, see frames 99 - 102 for qos 1 trace, see frames 47 - 50 for qos 2 trace, see frames 55 - 58

This might be relevant. Need to read some of the references: https://github.com/eclipse/mosquitto/issues/1590 Nope, don't think that applies to us, because we're not using secure HTTP, and it's not the Windows side that's delaying acks for us.

Need to run some more traces with power saving disabled as in Andrew's earler reference, which was: https://superuser.com/questions/1393936/mqtt-large-delay-between-messages#1394315

nerdoug commented 4 months ago

made the change to aaNetwork.cpp, using esp_wifi_set_ps (WIFI_PS_NONE); (note initial lower case e) with this near start of file:

include // Header file for linking.

// needed per https://superuser.com/questions/1393936/mqtt-large-delay-between-messages#1394315

include "esp_wifi.h"

And made 3 more tcpdump traces on broker using this command: tcpdump -i any -s 65535 -w /tmp/mqttfxM-qos2

think I did a sudo su before that to get superuser power. Had to rename files to have .cap so Wireshark could read them - should have created them with something like: tcpdump -i any -s 65535 -w /tmp/mqttfxM-qos2.cap

following zip file has traces for qos 0, 1, and 2

mqttfxM.zip

nerdoug commented 4 months ago

MQTT broker code always copies incoming message to one of the buffers (always the top one) , after shuffling all other buffers down to make space for it (even if they didn't hold messages.) Added code (in aaStringQueue::push) to output the buffers in use count if it exceeds 1, and this happens frequently with message spacing of 200 msec, but does not occur with 300 msec spacing. Also strange is that the arrival time of the messages that are queued (as seen in aaMqtt.cpp/onMqttMessage() ) is the same. So it looks like the arrival time of the message is changing, which seems strange. should do a simultaneous network capture to see if that's actually the case.

This implies that process of some MQTT messages is taking more than 200 msec. Seems a bit long, and maybe some optimization is in order. Changing message buffering to a ring buff er is a candidate, to reduce the unneeded buffer copying.

nerdoug commented 4 months ago

Did a tcpdump trace at the same time as I did a MQTT.fx TEST command followed by script download. The delay between script lines was 200 msec, and the dispatching interval for checkMqtt was 41 msec. Here's the network trace, and the serial console output:

mqtst.zip

The first case of queuing up an MQTT command comes at console time 56637:

H) processCmd{6}[56162]> command received: FLOW onMsg@ 56377 H) processCmd{6}[56408]> command received: FLOW onMsg@ 56637 onMsg@ 56637

Buffer Count = 2 H) processCmd{6}[56654]> command received: FLOW H) processCmd{6}[56695]> command received: FLOW The network trace is very ineresting. The messages arrive at the broker with the expected 200 msec spacing. (network frames 69, 77, 86). However, the robot is slow to acknowledge, and the messages received by the broker in frames 77 and 86 are merged together in frame 92 when they're sent to the robot. This frame goes out as soon as the robot acknowledges the outstanding frames. 68 17:50:56.374701 10.0.0.200 10.0.0.186 211 Publish Message [hfr48B8/commands] MQTT 69 17:50:56.448825 10.0.0.165 10.0.0.200 209 Publish Message [hfr48B8/commands] MQTT 70 17:50:56.448850 10.0.0.200 10.0.0.165 56 1883 → 57887 [ACK] Seq=1 Ack=905 Win=1002 Len=0 TCP 74 17:50:56.502125 10.0.0.165 10.0.0.200 56 57883 → 22 [ACK] Seq=1 Ack=1185 Win=509 Len=0 TCP 75 17:50:56.625977 10.0.0.186 10.0.0.200 56 53742 → 1883 [ACK] Seq=3 Ack=754 Win=4968 Len=0 TCP 76 17:50:56.626020 10.0.0.200 10.0.0.186 209 Publish Message [hfr48B8/commands] MQTT 77 17:50:56.649627 10.0.0.165 10.0.0.200 211 Publish Message [hfr48B8/commands] MQTT 78 17:50:56.649652 10.0.0.200 10.0.0.165 56 1883 → 57887 [ACK] Seq=1 Ack=1060 Win=1002 Len=0 TCP 82 17:50:56.704409 10.0.0.165 10.0.0.200 56 57883 → 22 [ACK] Seq=1 Ack=1373 Win=508 Len=0 TCP 86 17:50:56.849540 10.0.0.165 10.0.0.200 86 Publish Message [hfr48B8/commands] MQTT 87 17:50:56.849626 10.0.0.200 10.0.0.165 56 1883 → 57887 [ACK] Seq=1 Ack=1090 Win=1002 Len=0 TCP 91 17:50:56.885016 10.0.0.186 10.0.0.200 56 53742 → 1883 [ACK] Seq=3 Ack=907 Win=4815 Len=0 TCP 92 17:50:56.885077 10.0.0.200 10.0.0.186 241 Publish Message [hfr48B8/commands], Publish Message [hfr48B8/commands] MQTT 93 17:50:56.906807 10.0.0.165 10.0.0.200 56 57883 → 22 [ACK] Seq=1 Ack=1441 Win=508 Len=0 TCP 94 17:50:57.051163 10.0.0.165 10.0.0.200 216 Publish Message [hfr48B8/commands] MQTT 95 17:50:57.051192 10.0.0.200 10.0.0.165 56 1883 → 57887 [ACK] Seq=1 Ack=1250 Win=1002 Len=0 TCP 99 17:50:57.108055 10.0.0.165 10.0.0.200 56 57883 → 22 [ACK] Seq=1 Ack=1637 Win=513 Len=0 TCP 100 17:50:57.130986 10.0.0.186 10.0.0.200 56 53742 → 1883 [ACK] Seq=3 Ack=1092 Win=4630 Len=0 TCP 101 17:50:57.131037 10.0.0.200 10.0.0.186 216 Publish Message [hfr48B8/commands] MQTT 102 17:50:57.253888 10.0.0.165 10.0.0.200 214 Publish Message [hfr48B8/commands] MQTT 103 17:50:57.253918 10.0.0.200 10.0.0.165 56 1883 → 57887 [ACK] Seq=1 Ack=1408 Win=1002 Len=0 TCP 107 17:50:57.310415 10.0.0.165 10.0.0.200 56 57883 → 22 [ACK] Seq=1 Ack=1833 Win=512 Len=0 TCP 108 17:50:57.378540 10.0.0.186 10.0.0.200 56 53742 → 1883 [ACK] Seq=3 Ack=1252 Win=4470 Len=0 TCP ``` 0000 00 04 00 01 00 06 b8 27 eb 2e f5 8a 00 00 08 00 .......'........ 0010 45 00 00 e1 97 3f 40 00 40 06 8d 56 0a 00 00 c8 E....?@.@..V.... 0020 0a 00 00 ba 07 5b d1 ee 3d da fc 47 2d 34 d2 4b .....[..=..G-4.K 0030 50 18 fa 4c be 60 00 00 30 98 01 00 10 68 66 72 P..L.`..0....hfr 0040 34 38 42 38 2f 63 6f 6d 6d 61 6e 64 73 46 6c 6f 48B8/commandsFlo 0050 77 2c 33 30 30 2c 4d 47 43 2c 31 30 2c 30 2c 30 w,300,MGC,10,0,0 0060 2c 30 20 2c 31 36 2e 32 2c 2d 31 32 2e 33 39 2c ,0 ,16.2,-12.39, 0070 2d 31 31 2e 30 36 20 2c 30 2c 2d 31 36 2e 38 36 -11.06 ,0,-16.86 0080 2c 2d 31 31 2e 30 36 20 2c 2d 31 36 2e 32 2c 2d ,-11.06 ,-16.2,- 0090 31 32 2e 33 39 2c 2d 31 31 2e 30 36 20 2c 31 36 12.39,-11.06 ,16 00a0 2e 32 2c 31 32 2e 33 39 2c 2d 31 31 2e 30 36 20 .2,12.39,-11.06 00b0 2c 30 2c 31 36 2e 38 36 2c 2d 31 31 2e 30 36 20 ,0,16.86,-11.06 00c0 2c 2d 31 36 2e 32 2c 31 32 2e 33 39 2c 2d 31 31 ,-16.2,12.39,-11 00d0 2e 30 36 30 1c 00 10 68 66 72 34 38 42 38 2f 63 .060...hfr48B8/c 00e0 6f 6d 6d 61 6e 64 73 66 6c 6f 77 2c 30 2c 42 4f ommandsflow,0,BO 00f0 4d M ``` This explains why there are two incoming messages in the console trace with the same timestamp - they're in the same message. Guess I need a debug message when each message is extracted from the queue. It still seems strange to me that processing these message is taking so long - maybe there's something else going on that's making the CPU busy?
nerdoug commented 4 months ago

Some quick notes before I forget them: -I'm seeing some IPV6 traffic is network traces, but the IP addresses are all ::1. -wondering if Nagles algorithm is slowing down the transmissions from broker to robot. There's a mosquitto option to disable it, which would be worth trying. Info on Nagle's stuff: https://networkencyclopedia.com/nagles-algorithm/

Here's another explanation that may be clearer: https://en.wikipedia.org/wiki/Nagle%27s_algorithm

nerdoug commented 4 months ago

I have modified the mosquitto configuration to disable Nagles algorithm in 2 places: /etc/mosquitto/mosquitto.conf /etc/mosquitto/conf.d/mosquitto.conf

In each file I aded the text: set_tcp_nodelay true I didn't see any messages in the log saying it was disabled: /var/log/mosquitto/mosquitto.log

I suspect the problem is a well known TCP optimization called delayed ACKs is built in to either Mosquitto or the raspberry TCP/IP stack. I can't find any way to disable it.

After some thought, I'm not sure this is a real problem. My previous observations that we're losing MQTT data was wrong, and were due to my own actions to reduce the work done in the script for simplified debugging, and a error in implementing command identifiers in the milleseconds field of the FLOW command.

To verify this, I cranked the time between MQTT.fx transmission of commands down to 100 msec. Normal operation is that every MQTT command is entered into the multi-buffer structure, and is extracted when the command is processed. This means that having 1 command "queued" is normal. With 200 msec delays between MQTT.fx transmissions, I occasionally saw 2 commands buffered. With 100 msec delays I saw quite a few cases with a queue of 2, and occasionally, a queue of 3. However, I never saw any data loss, as demonstrated by debug displays of what arrived at the robot.

With a delay of 50 msec between MQTT.fx transmissions, we overran the depth of the queuing system. I saw 6 commands queued, and one command was lost. The number of buffers available is controlled by BUFFER_MAX_SIZE in aaStringQueue.cpp, and is set to 5. Seeing this get to 6 confirms our queue was overrun. The other control parameter is COMMAND_MAX_LENGTH which is 200.

I'm going to enter another issue for a review of the buffer handling routines. There seems to be more buffer shuffling, copying, and zeroing than is actually needed. This is sensitive, because much of the buffer handling is asynchronous to loop(), and acts like an interrupt that can happen anywhere.

I did make some code changes while investigating this, and I'll leave the issue open until I clean out the debug stuff and document the actual changes.