Fix flow control broken by USE_FEATURE_MAVLINKX

brad112358 commented 10 months ago

USE_FEATURE_MAVLINKX had a couple of problems with respect to txbuf calculations which badly broke flow control.

1) The calculation of buffered received serial data in serial_in_available() needs to include the data which is in the process of being parsed as well as the data which has already been parsed but not yet sent over the radio link.

2) We should not parse and transfer to fifo_link_out more than one message in the while loop since we only checked that there is space for one message. We also shouldn't spend too much time here without doing flow control.

olliw42 commented 10 months ago

interesting I thought I did got it right, but maybe not could you pl share what sympthoms you were seeing which made you conclude that it badly broke flow control?

brad112358 commented 10 months ago

The symptoms without these fixes were lost messages at all baud rates between 38400 and 230400 and with all combinations of mavlink, mavlinkx, and the two txbuf methods. The message loss was 3 to 40% depending on settings as measured by QGC in the application->mavlink page. Slow and failed parameter download was also observed. Message loss was highest during bulk parameter download, but non-zero in steady state as well. As you know, I only have R9, so 19Hz. My QGC configuration requests flow rates slightly above what the radio link can handle (some flows run at about half the requested rate because of flow control). This should be enough info for you to reproduce the problems if you want proof. I fixed the byte counting problem first, but that wasn't enough to eliminate the symptoms, so I added the break which completely solved the problem.

Unless you check for lost messages based on sequence numbers, parameter download might be the only place you would notice the problem. Perhaps you didn't notice the problems because you use higher radio message rates which allows ftp for parameter download at some baud rates.

Note that I normally use 230400 baud to minimize the latency caused by needing to parse mavlink messages at both ends of the radio link and I normally use method_b for flow control (even for ardupilot). But other configurations also lost messages.

I'm not sure why it's not obvious to you as it is to me that both of these changes are required. I will try to outline my thinking.

Without the counting fix, we will never trigger on the tests for serial_in_available() > 1024 because the buffer is only 512 in size and that's as big as it can get, so we will overflow our buffers during parameter download which won't stop until we send txbuf < 50. We also won't slow transmission as quickly for other flows.

Even with the counting fix but without the break or some other code change inside the loop we will loose messages. Anytime the autopilot is sending faster than the radio link, the two buffers, serial, and fifo_link_out will buffer more and more data over time. Over the course of several calls to Do(), first, fifo_link_out will first fill to less than 290 bytes remaining. Then, eventually in subsequent calls to Do(), serial will also begin to fill until serial_in_available() gets higher than 1024. Only then, we will stop parameter flow or rapidly slow down normal flows. At some point, we will have more than 290 bytes (perhaps up to about 1024-512+290) of data waiting in serial and the while loop will attempt to parse all of it in one call to Do() and will overflow fifo_link_out, loosing at least one message.

The break is better than adding an extra test inside the loop because it limits the time we spend in the loop a bit and reduces the time we parse data without checking to see if we need to set TXBUF_STATE_BURST or do other txbuf related checks. Parsing at most one message helps reduce our flow control response time to bursts of messages from the FC (remember, ardupilot is very bursty) and reduces the chances of buffer overflow and reduces buffering related message latency. The significance of these effects depends on the time it takes to parse and copy a couple of hundred bytes. Unless you know for sure that this time is much shorter than the time it takes to receive a message at 230400 baud, the break is better. Even if you know this timing precisely, the break is simpler.

olliw42 commented 10 months ago

many thx for your explanation. Much appreciated. I'll merge, and do some minor cosmetics on it

I do regularly also look at the behavior in 19 Hz mode, but I hadn't noticed anything like "uuuupppssss". What you describe sounds like rather catastrophic, so I'm not sure why I didn't notice but i won't rule out that I may not have investigated as carefully as you may have done.

olliw42 commented 10 months ago

@brad112358 maybe you recall, I'm quite confused momentarily: what was bytes_serial_in for? Why isn't it relevant to take into account either? It's used to calculate the rate on which the txbuf value is determined. So I would think it shoudl relate to the bytes as they are flowing in. To me it seems however that it actually should better be called bytes_link_out, since it's counted in MavlinkBase::getc(), which is called to get the bytes for the link. Which is not the count of the inflowing bytes. It somehow seems to me we may have a more serious bug here.

brad112358 commented 10 months ago

Isn't it amazing how quickly code we write becomes less than obvious? As you may remember (or, perhaps not, but you will see in the code), this variable is used in calculating the percentage of the radio link used in the last second so we can base txbuf on data rate rather than buffer used. It's name seemed fine to me when you wrote the code, but, I had the same thought when reviewing the code while investigating the message loss last week. Unfortunately, MAVLinkX translation complicates things. I still don't like this approach which is why I hadn't tested much with MAVLINKX code enabled and missed the flow control problem earlier.

In the presence of compression and to a lesser extent, even with just MAVLinkX translation, the amount of data we want to send over the radio link in a second no longer exactly matches the number of bytes we read from the serial link. We can probably make a more accurate estimate of average radio link usage by counting bytes after the translation to MAVLinkX. So, not only should the name change, the calculation should also change to count bytes placed into fifo_link_out.

Unfortunately, this will also introduce latency in responding to surges in traffic because fifo_link_out is relatively small and bytes received can't always be translated and counted as soon as they come in over the serial link. It's not clear to me how much this latency might slow down our response (via txbuf) to changes in flow rates. But, if we are only counting serial input bytes and not compressed size, I don't see how compression was actually providing any useful extra link capacity since flow control was based on uncompressed size. If this change causes problems, we might need to increase the size of fifo_link_out. I also have some other ideas on how to increase flow control responsiveness in the case of bursts of traffic which I'll save for another discussion.

olliw42 commented 10 months ago

Isn't it amazing how quickly code we write becomes less than obvious?

I've learned this lesson decades ago. That's why I strive to find names and code architectures which are easier to understand (not always successful for sure but it's a constant consideration). I realized that I need 1/2 to 1 day to immerse back into a project, currently I'm not with mLRS but doing much STorM32 and ArduPilot stuff.

this variable is used in calculating the percentage of the radio link used in the last second so we can base txbuf on data rate rather than buffer used.

I think I was saying this :). The question I wonder about rather is, should it be related to out flow on the link or something else.

I had the same thought when reviewing the code while investigating the message loss

so I was not very off. I will rename it then to bytes_link_out, as it seems we both think that's what it is.

So, not only should the name change, the calculation should also change to count bytes placed into fifo_link_out.

I read this to confirm that you too are suspecting that one might not base it off out flowing. Before mavlinkx it was kind of irrelevant, since - as you also say - it was the same. With mavlinkx it made sense to me to base it on the out flowing bytes, since that's what is really on the link, so the modified data amount is accounted for. It clearly should not be based on serial_in flow. So maybe on something intermediate, as you indicate. I guess I'll need to think about this at some point more deeply.

In the presence of compression and to a lesser extent, even with just MAVLinkX translation,

I think you miss a bit that the code enabled with USE_MLRS_MAVLINKX is not at all only about mavlinkx. It's about adding a parser into this data path. This parser, and thus very same code structure, is e.g. also needed in order to handle commands to put the reciever into boot mode ... for flashing via fc ... wasn't this what you voted for strongly? ;) :) :D You won't get this without such a parser ...

Unfortunately, this will also introduce latency in responding to surges in traffic because fifo_link_out is relatively small and bytes received can't always be translated and counted as soon as they come in over the serial link.

I'm still not on the same page here with you. Fact is that the parser loop is called very very often, in comparsion to the time for a byte, so it can very quickly process any incomming byte. In my opinion there are then two scenarios, (a) we empty serial_in very quickly (meaning small buffer needed on serial_in) and put the data into the parser and fifo_link_out (meaning large fifo buffer), or (b) we empty serial_in 'slowly', message by message (meaning large buffer needed on serial_in) and put the data into the fifo (meaning small fifo buffer). I took the (b) approach since it's less intrusive code wise. Both approaches should have however exactly the same effect, i.e. the message should be avaialable to the link at the very same points in time. There is no latency coming from this (i.e., any latency is really small compared to the time for a byte). A case (c) with two large buffers is obviously again equal. So, the main factor for latency is not a small fifo buffer size, it's the fact that there is a parser, and this would be so for both (a) or (b) (or (c)). Sure, the counting of the rate for getting to txbuf might be very different for (a) or (b), but we this paragraph was about the sources and effects of latency. Another piece of latency comes in btw from the fact that the out flow rate is calculated in getc() (which is called only at the rate of the mode), but this was so before, and I guess we implicitely deduced/assumed that this latency isn't very relevant. So, I conclude, we don't have a fundamental challenge from mavlinkx, the fundamental challenge comes from that we have a parser. It might however tunr out that for counting correctly we better should choose (a) instead of (b). I guess I'll need to think about this too more deeply. LOL

olliw42 / mLRS

Fix flow control broken by USE_FEATURE_MAVLINKX #141