Try harder to align MAVLink frame to start of UDP payload

olliw42 / mLRS

2.4 GHz & 915/868 MHz & 433 MHz/70 cm LoRa based telemetry and radio link for remote controlled vehicles

GNU General Public License v3.0

295 stars 67 forks source link

Try harder to align MAVLink frame to start of UDP payload #71

Closed brad112358 closed 1 year ago

brad112358 commented 1 year ago

This change further reduces UDP packet rate by eliminating a 1 byte payload that happened with baud rates below 115200 and reduces average latency by detecting the gaps between serial writes on the mLRS side.

It makes protocol tracing on the WiFi side much easier at most baud rates. See wireshark screen shots attached (tested at 57600 baud)

This has been tested at 19Hz with all baud rates from 19200 to 115200 and found to improve on the previous code in all cases. I expect it should work better for 50Hz as well, especially at 115200 baud.

after before

olliw42 commented 1 year ago

interesting

few quick comments

please revert the parts, which are otherwise unrelated but where you go from ms to us to avoid calling millis(). Rather call both millis() and micros().
please go with the naming convention name_txxx_us, i.e. e.g., received_serial_tfirst_us instead of tfirst_received_serial_us

The mechanism is somewhat hard to understand just from looking at the code. I wonder, is it somehow possible to modify the code so it's functionally identical but better exposes how it works? I mean, it looks somewhat odd that you are trying to detect the time gaps but the code is structured around the last availability, one woudl firts think it should(could) be the other way around.

If I get it right, you exploit that always complete mavlink packets are send and so try to snyc on the frame start. Wouldn't it be easier to spy the mavlink stx? Shouldn't the 600 us be dependent on baudrate?

You say it avoids 1 byte payloads, hence: Why not just not sending then avail is below a certain marging, like one mavlink header?

olliw42 commented 1 year ago

haven't tried, just guessing, wouldn't something like this do the same?

    int avail = SERIAL.available();
    if (avail > avail_last) {
        received_serial_tlast_us = tnow_us; // we received something
    }
    if ( ((avail == avail_last) && (tnow_us - received_serial_tlast_us > 600) && avail)) || // didn't receive anything for 600 us, so send
         (avail > 255) || // we have a lot to send, so send
         ((tnow_us - send_wifi_tlast_us > 22000) && avail) // didn't send for a long while, so send
       ) {
        int len = SERIAL.read(buf, sizeof(buf));
        udp.beginPacket(ip_udp, port_udp);
        udp.write(buf, len);
        udp.endPacket();
        send_wifi_tlast_us = tnow_us;
        avail_last = avail - len;
    } else {
        avail_last = avail;
    }

olliw42 commented 1 year ago

btw, I did a somwhat similar thing for uart tunneling via uavcan several years back: https://github.com/olliw42/BetaPilot/blob/BetaCopter-3.6.7-v020u/libraries/AP_UAVCAN/BP_UavcanTunnelManager.cpp#L122-L152 I find the logic quite better to read/follow

brad112358 commented 1 year ago

Your suggested code sample does not do the same because there may be more than a 22ms gap between reception of bytes on the serial port. If this happens, your code will send a 1 byte payload. This is one of the problems I fixed. It's better to limit the time since the first byte received in a burst, rather than the last byte sent in the previous burst because our goal is to limit latency. So, keeping track of the first byte receive time in a burst instead of the last wifi send time lets us know how old the oldest data we are holding is.

In your sample code, you set avail_last to allow for short reads and for the serial buffer to grow larger than the UDP buffer. I don't expect this to ever happen, but I agree we should allow for it. I need to consider how best to handle this case.

brad112358 commented 1 year ago

please revert the parts, which are otherwise unrelated but where you go from ms to us to avoid calling millis(). Rather call both millis() and micros().

OK, no problem, but I'm curious why you prefer the current time in two forms. If it's for readability, would it be OK to just replace the constants like 2000000 with #defines to make it more clear what the time limits are? Or, perhaps 2000*MICROSECONDS_PER_MILLISECOND? If the reason is you just prefer a minimal diff, OK. If it's something else, please share.

please go with the naming convention name_txxx_us, i.e. e.g., received_serial_tfirst_us instead of tfirst_received_serial_us

If I get it right, you exploit that always complete mavlink packets are send and so try to snyc on the frame start. Wouldn't it be easier to spy the mavlink stx?

I did it this way because it also works reasonably well when the transported data is transparent serial since mLRS supports this. If you don't feel the need to support anything other than MAVLink here, I agree, using the MAVLink framing would be easier and more reliable, since we can rely on the message length field and don't need to check the CRC so we don't need a full parser. Lets decide on this first and, perhaps we can skip this PR.

olliw42 commented 1 year ago

Your suggested code sample does not do the same

I guess I wasn't trying to say that it does exactly the same but was wondering if it could achieve the same. But sure, you are right, it doesn't the same. Concerning look at the first or "last" byte's time stamp, see the time structure discussion below, could be it's the same at the end.

Did you had a chance to look at the code in the github link? I think it targets the same problem (and in this sense does do more closely what your code is doing).

My general point is still that, while the first if part is pretty clear, the second if part is somewhat difficult to digest. I.e., I wonder if the code can't be restructered to make it's logic more immediately obvious.

I'm curious why you prefer the current time in two forms.

No deep reason here. It's in parts indeed to have a minimal diff, in parts to have more natural numbers especially as it's really negligible cost to call both, but it's also to have it more contained, the us is only needed in the "function" for udp and in this sense is local to it (and eventually may in fact become truley local)

I did it this way because it also works reasonably well when the transported data is transparent serial since mLRS supports this.

that's indeed a relevant argument. At some point we should have some way of commincation between the esp and the mLRS module, to e.g. configure the settings in the esp code, and then we could make it more mavlink specific. But we haven't thatr yet so it's better to not do it.

However, the comment it should also work for transparent mode, made me start to wonder what you are actually synchronizing to in the code. We have two types of time structure, the one by sending full mavlink messages, which from your text appears to be the structure you want to sync on. But there is also the time structure due to the 19 Hz, which results in a bunching into packets since the link rate is so much smaller than the baudrate (14 ms packets for 57600). Since 1/19 is 52 ms and thus larger than the 22 ms in the code, one could think it's syncing on the 19Hz time structure. With that in mind, I think one needs to at least test this carefully also for the other modes, 31 Hz and especially 50 Hz.

With that in mind, I'm also not sure I see why your code would prevent sending a 1 byte payload. There is a time gap of 52-14 = 38 ms, so the code will always have sent out all bytes by the time the next link frame comes in. And if a frame has just 1 byte payload, this 1 byte will be send. I probably have missed a key aspect. Could you clarify?

You are repeatedly mentioning 1 byte payloads as being harmfull. Why exactly 1 byte payloads? Why not 2 bytes? What's the argument here?

brad112358 commented 1 year ago

Did you had a chance to look at the code in the github link? I think it targets the same problem (and in this sense does do more closely what your code is doing).

I did look at this. I believe the problem, while interesting, is not similar enough to draw useful logic from.

However, the comment it should also work for transparent mode, made me start to wonder what you are actually synchronizing to in the code. We have two types of time structure, the one by sending full mavlink messages, which from your text appears to be the structure you want to sync on. But there is also the time structure due to the 19 Hz, which results in a bunching into packets since the link rate is so much smaller than the baudrate (14 ms packets for 57600). Since 1/19 is 52 ms and thus larger than the 22 ms in the code, one could think it's syncing on the 19Hz time structure.

If the data is serial, it is sent out the serial port at 19Hz intervals and we sync to that. If it is MAVLink, the data is sent out as one or more complete MAVLink frames, possibly with more than 1/19th second between them (in the case of a frame larger than 82 bytes). We sync to the gaps either way and in the case of MAVLink the UDP frame will usually contain complete frames because complete frames are sent at a time on the serial link.

With that in mind, I'm also not sure I see why your code would prevent sending a 1 byte payload. There is a time gap of 52-14 = 38 ms, so the code will always have sent out all bytes by the time the next link frame comes in. And if a frame has just 1 byte payload, this 1 byte will be send. I probably have missed a key aspect. Could you clarify?

With the existing code, because the time gap is larger than 15ms, the first byte received in the burst will be sent out alone in it's own UDP message even when the frame has more than 1 byte. This is not what we want and I consider it a bug in the previous code. The problem is caused by comparing the time with the time of last UDP message rather than the time of first character received since the last UDP message.

The new code corrects this. You can see the behavior before and after this change in the screen shots I attached. Did you look at them carefully?

brad112358 commented 1 year ago

Shouldn't the 600 us be dependent on baudrate?

Yes, it probably should. 2 character times should be enough to be sure it's a real gap.

olliw42 commented 1 year ago

but why is this 1 byte such a problem, is it because UDP and 1 byte works bad, is it because sending out two udp packest in short sequence is bad, or what is it?

it sounds to me that's the basic of what you try to achieve, to avoid these 1 byte UDP packets, so I'd like to understand this :)

comparing the time with the time of

but this part of the if should never trigger in your code, right, and is more a "let's play it safe and let's feel good" piece, isn't it?

brad112358 commented 1 year ago

It's not "such" a problem. Just a bug. It doubles the number of UDP packets for no good reason and it splits MAVLink messages into two parts, preventing simple decode by wireshark. Note that having a tool like wireshark for decoding this traffic can be very useful for debugging other MAVLink related issues.

brad112358 commented 1 year ago

but this part of the if should never trigger in your code, right, and is more a "let's play it safe and let's feel good" piece, isn't it?

Not at all. It's happening after every gap in the screenshot I sent.

olliw42 commented 1 year ago

Not at all. It's happening after every gap in the screenshot I sent.

I don't understand this the 600us thing in the if should have found the end of the packet so that all should have been sent out by the time the xx ms have passed, but since avail = zero or the time had be retriggered in case of new data the xx ms part in the if should never trigger

brad112358 commented 1 year ago

You seem to have missed that the second screen shot and my discussion of the one byte messages are with the previous version of this code, not this PR.

olliw42 commented 1 year ago

??? not sure how there could be a confusion of the context

you said

The problem is caused by comparing the time with the time of last UDP message rather than the time of first character received since the last UDP message.

obviously you comment on what is wrong in the existing code and how it is alegdly corrected in your PR

I said

but this part of the if should never trigger in your code, right, and is more a "let's play it safe and let's feel good" piece, isn't it?

obviously, in this part of the conversation I discuss this PR, refereing to the piece where you compare the time with the time of first character received since the last UDP message

you then said

Not at all. It's happening after every gap in the screenshot I sent.

I then said

I don't understand this. the 600us thing in the if should have found the end of the packet so that all should have been sent out by the time the xx ms have passed, but since avail = zero or the time had be retriggered in case of new data the xx ms part in the if should never trigger

you then said

You seem to have missed that the second screen shot and my discussion of the one byte messages are with the previous version of this code, not this PR.

does not look to me like I did lose the context ;)

anyway, let me try again

In YOUR code, you do have three parts in the first if, the 600us piece, the >255 piece, and the time comparing piece. I was commenting that the last part never triggers. Question and context clear now? You still answer "It's happening after every gap"?

brad112358 commented 1 year ago

Sorry for the misunderstanding. I guess I should have said It was happening after every gap with the bug in the previous version.

Yes, neither of the last two tests are likely to trigger unless a MAVLink message is very large or the baud rate is too low and we don't have gaps. One could argue that they are not needed or that we should be testing for 255+MAVLink header size instead of 255, but these tests are here to provide insurance.

olliw42 commented 1 year ago

the 2nd test is needed, since the buffer we have is limited, and it will also capture the cases there we have no 600 us gaps. My point was that the third part never triggers. You seem to agree only half-heartedly when you say not likely to trigger. So, again, did you see a case where the third part triggered, or do you have a theoretical case or can think of a case where it triggers?

brad112358 commented 1 year ago

Are you looking at your code while talking about my code? The third part is the length test, not the time test as you seem to be thinking.

The comments explain each test in order. The first test is to find the gaps if they exist. The second test constrains the latency. It would only be triggered if the baud rate were low enough that a message takes more than 22ms to send over the serial port. At 57600 baud, that's 128 bytes, at 115200, that's 256 bytes. The test for 255 bytes is needed if there are no gaps at high baud rates

Yes, I think I can make all of these things happen in my testing.

I also see some strange behavior where sometimes it seems like a gap is detected where there should be none and I get short packets even with this code. Perhaps the serial interrupt is delayed or masked for a while, but no data is lost and it only rarely affects the ability for wireshark to decode a message easily.

olliw42 commented 1 year ago

@brad112358

first off, you will find that I have reverted your earlier PR "Don't send so many very short UDP messages to avoid message loss", as I found it to NOT work for me! I in fact spend more than an hour on a new build to find why UDP wouldn't work, and only when it occured to me that maybe it's this. And indeed, the second I did undo this, it all started to work fine for me. I'm using Ardu4.4, MissonPlanner1.3.80.

I don't know what the issue is, and sure admit that I wouldn't have expected it from what the code does, but that's what it is.

What it should tell us, and be a warning: This sort of stuff needs to be tested much much better. We have so many different setups now, and use cases, and it needs to work for all. Testing on just one bench apparently doesn't allow us to ensure that, and even if the code doesn't look like it could cause issues, it must be done.

olliw42 commented 1 year ago

now back to the discussion

The third part is the length test, not the time test as you seem to be thinking.

oh sorry, indeed, mixed that up. I'm still talking here about your code.

Yes, I think I can make all of these things happen in my testing.

I think it's obvious that you can't (then the timeout limit is longer than it takes to receive 255 bytes, or the bytes limit is smaller than the timeout limit).

I think you don't see where I'm heading at. The point is that in your existing code you don't need that 2nd part/timeout part for it to work. The 3d part is fully suffcient.

So, my conclusions: Some of your statements and conclusions are incorrect. This 2nd part is not functionally essential. Therefore "The problem is caused by comparing the time with the time of last UDP message rather than the time of first character received since the last UDP message." is incorrect as this is not the only approach to resolve the issue. Also the statement that my suggested code would not work similarly is incorrect. In the above, two approaches for handling the issue have come up, let's call them "time-stamping first received byte" or just "timestamp" and "find packet end by time gap" or just "timegap". The timestamp approach is what your 2nd part in the if relates to, the timegap approach is what your 1st part in the if relates to. Either will work, in the sense of accomplishing the goal. Your code is confusing and overly complicated (which is the root of teh discussion), as it intermixes the two. Therefore also "I believe the problem, while interesting, is not similar enough to draw useful logic from." is incorrect, given the linked code is exactly about the timestamp approach.

So, we appear to have two approaches now. I can't tell which one should be better, as I (currently) don't see a clear reason for why one should be prefered one over the other. But I think we should have means now to simplify the code, at least with respect to making it easier to understand.

And, with the example of the previous PR in mind, we need to test it really carefully. Like in CAREFULLY.

brad112358 commented 1 year ago

I don't know what the issue is, and sure admit that I wouldn't have expected it from what the code does, but that's what it is.

That is strange indeed. I did later test it with the latest mission planner so the only thing that seems much different would be the mLRS message rate. But I also can't see why that matters. Are you able to reproduce the failure by applying the change again? I find it's surprisingly easy to introduce an unknown variable in testing.

I guess I should invest in additional hardware so I can test 30 and 50Hz.

I think it's obvious that you can't (then the timeout limit is longer than it takes to receive 255 bytes, or the bytes limit is smaller than the timeout limit).

It sounds like you are assuming a specific baud rate (unless my math is wrong). I already mentioned that which test triggers depends on the baud rate. If you want to fix the baud rate at 115200, then I agree we shouldn't need both tests. Or, if you are OK with a latency sometimes higher than 22ms at lower baud rates, we can eliminate the latency test.

I prefer to keep all three tests, but we can eliminate the latency test if you don't want to constrain latency or if you want to fix the baud rate at 115200 or higher.

This discussion reminds me. Is there a reason you have constrained the serial speed on the Tx side to 115200? Is there any reason not to allow 230400 baud? Higher rates do reduce latency just a little.

Of course, I also agree that rigorous testing is important. Have you tested this PR yet? Given your result with the previous version, it would be very interesting to see what happens with this PR. If we missed something in the previous version, this code might be broken too, though I can't (yet) see how.

olliw42 commented 1 year ago

I did quite a number of testing today, with a 868MHz system for both 19Hz and 31Hz mode, using code from simple to more complicated, including yours, and also variations of it. I used the M5 C3, since its usb-ttl adapter allows easy debuging (the code below should give you an idea how I "measured" what each code is doing). Setup as usual ArduPilot4.4, MissionPlanner1.3.80. Did not test QGC.

In some way I like the simplest best. It for sure does not always for every single packet do what would be "perfect", but it massively better clusters the data than the original code. The original code indeed most of the time sends packets with 1,2,3 bytes, with larger packets less often.

The simplest code I came up with is

    static unsigned long tfirst_us = 0; //in real code should not be static an be better named ;)

    unsigned long t_us = micros();
    int avail = SERIAL.available();
    if (avail <= 0) {
        tfirst_us = t_us;      
    } else 
    if ((t_us - tfirst_us) > 10000 || avail > 128) { // 10 ms at 57600 bps corresponds to 57 bytes, no chance for 128 bytes
        tfirst_us = t_us;

        int len = SERIAL.read(buf, sizeof(buf));
        udp.beginPacket(ip_udp, port_udp);
        udp.write(buf, len);
        udp.endPacket();

        static unsigned long send_tlast_us = 0;
        DBG_PRINT(t_us);
        DBG_PRINT(", ");
        DBG_PRINT(t_us - send_tlast_us);
        DBG_PRINT(", ");
        DBG_PRINTLN(len);
        send_tlast_us = t_us;
    }

As said, the number of udp packest is massively smaller, and most packets are around 50 bytes (for a baudrate of 57600), with sometimes also smaller packest of 6 bytes or so. The numbers make sense. One could get larger average packets, by choosing longer timeout (or higher baudrate), but larger timeout means higher latency. Note that also the "find the end of a packet" which you introduced is typically not very effective in reducing the latency since most of the time the time guard triggers (in my testings at least), so the benefit is only sporadic. Larger packets always means higher latency, since - at a given baudrate - packet size and time are proportionally related. The code works essentially equally for 19Hz and 31Hz mode, i.e., packs into typically ca 50 bytes packets, which is clear, since that's the 10 ms at 57600 (and analogously for other baudrates). I thus conclude that any approach which does a better packing, in terms of larger average packet sizes of the udp packest, necessarily has more latency. Which kind of implies that the simplest isn't really that less optiomal than "better" approaches. I have not tested for 50 Hz, but I wouldn't expect any surprises, given that the things scale as expected for 19 & 31 Hz.

I think the best approach to reduce latency is to go to higher baudrate. I think I voiced this somewhere before: I wonder if we shouldn't decide to use 115200 as default on the tx side. This would make it ca 100 bytes per udp packet, if we go with 10 ms.

Given the simplicty, effectiveness, and transparency of the code, this would be my suggestion. Would be interesting to hear your experience with it (the C3U may not be so easy to use with the debug, at least I found the PyQT which also has a native usb quite tricky to use).

brad112358 commented 1 year ago

I strongly prefer something that syncs to the gaps since it allows the standard MAVLink protocol decode to work on the UDP messages

brad112358 commented 1 year ago

What did you see when you tested the version in this PR? Did it work well for you? Have you tried wireshark or another protocol analyzer to decode the MAVLink messages?

olliw42 commented 1 year ago

it worked, but in my debug output I didn't saw significant advantage, as described in the previous post I've described my "analysis tool" in the previous post. No, I sure did not use wireshark or any other protocol analyzer, I've never used wireshark or similar and am not going to do.

olliw42 commented 1 year ago

@brad112358 you may have seen the last commit where I pushed the above code to main. I have done many more testing for 19,31,50Hz and 57600,115200 and as simple as it is it works fine for me and does a relatively good job in pooling the bytes. Given that what was in main currently isn't good, as you pointed out, I figured it's way better to have this than nothing. This does not mean that it would close the door for further improved code as discussed in this PR (although I think it will be not so easy to make it significantly better) Just wanted to share this :) Olli

brad112358 commented 1 year ago

@olliw42

Thanks for the note. I may have overstated the problem with my first PR. It had a bug, but was not broken in a way anyone would ever notice unless they used a protocol analyzer like wireshark. Your current version fixes the bug, but it still leaves wireshark fairly useless as it is unable to decode MAVLink in UDP when the MAVLink frames are split between UDP packets.

I do like the improved LED use though! I was thinking about doing exactly that for the boards that support it. And the new ssid is also a good idea.

I have pushed a new slightly simplified version of this branch. I have increased the default baud rate. Since it did nothing at higher baud rates, I removed the latency guard test which eliminates a state variable and simplifies the logic a bit.

I also now compute the gap time from the baud rate. Note that there seems to be a strange bug in the Arduino base code which causes serial reception to pause for up to about 2ms. This is why I had bumped the gap time up to 600us (which still wasn't enough). Luckily, the delay only seems to happen every few seconds. This does cause a few MAVLink frames to be split between UDP packets, but it is few enough that I don't think it reduces the utility of the protocol analyzers. This obviously has no impact on the GCS which is happy to reassemble the MAVLink frames as long as the message rate doesn't get too high.

You said you don't use Wireshark. I find that having such a tool really helps in understanding what's going on. I wish I had it sooner for use with mLRS. I you don't have a better way to examine MAVLink traffic, you should try it! If you have a better way, please share.

A few (actually, many) words about the latency introduced by the WiFi bridge in the radio to GCS direction...

The latency calculations here are mostly independent of the 19, 31, or 50Hz radio link. I'm going to assume our radio link is nearly fully utilized because that's the worst case. I'm also going to assume 115200 baud here since anyone who cares about latency should use the fastest rate available.

I'd also like to see the Tx module support at least 230400 baud.

It is non-trivial to accurately measure the one directional latency of the WiFi side of the connection. Wired Ethernet is very low latency but WiFi is higher. I'm seeing about 2.8ms average round trip for ping from my Linux box to the esp32. Let's guess the latency is half of that, about 1.5ms. The WiFi connection is more than an order of magnitude faster than our serial links, so I think we can ignore the effect of UDP message size on the latency. This means there shouldn't be much difference in the WiFi component of latency between our two approaches so I will ignore this factor in the follow discussion.

I will also ignore the serial transmission latency from the Tx module to the WiFi bridge. This latency is unavoidable and dependent on the MAVLink message size and the baud rate. The latency of a typical burst averaging 82 bytes at 115200 would be 7.12ms. Note that without the WiFi bridge, we would still incur this latency in the GCS. Again, why not 230400?

First, let's remember that since it's MAVLink, the GCS can't use any of the bytes sent from a frame until it receives the last (CRC) byte. So what we need to compare is the additional processing delay incurred by the last byte in each MAVLink frame.

As we know, with the original code which sent way too many 1-3 byte UDP payloads and overwhelmed QGC on android, the (additional processing) latency was 1 to 3 character times, but, I found it was unusable with QGroundControl on both Linux and Android.

In your current approach, the last byte may be at any position in the 115 byte UDP payload. The minimum latency is 1 character time or .087ms (or perhaps it is 0 since this latency is included in the serial transmission above). The maximum latency is 10ms. The average latency is about 5ms.

In my approach, a burst of 1 to 6 MAVLink message are sent in each UDP packet. In practice, for common message types, I see 2-3 MAVLink messages in each UDP packet. The maximum latency of the last of these MAVLink messages is 3 times 0.087ms or about 0.26ms. The maximum latency of the first MAVLink message is 82 times 0.087ms or about 7.12ms The middle message will be somewhere between. We can estimate the average latency of the last byte of a MAVLink message to be about 3.7ms.

In your current approach, UDP message payloads may contain from 1 to 115 bytes. I'm not sure what average you would see, perhaps 50-60 bytes?.

In my approach, UDP message payload size could in theory vary from about 15 bytes to 256 bytes, but for a busy radio link, in practice, I observe 21 to 121 byte payloads with an average message size of perhaps 80 bytes.

From this analysis and observations, I conclude that my approach has similar or lower WiFi link utilization (less interference with 2.4GHz) and slightly lower latency.

You could lower the latency with your method at the expense of higher UDP message rate by reducing your wait time.

My method can't really be adjusted much without resorting to finding the MAVLink frame boundaries by looking for STX and using the length byte. I'd be happy to code this approach up if you want to further reduce latency to near 0. This would also improve protocol decode from 99% to 100% in Wireshark and other analyzers. And I can also make it work well enough with transparent serial mode. But if you didn't like the complexity of the first version of this PR, you may not like the complexity of this theoretically better approach either.

Let me be clear that as long as the latency is reasonable, as it is with your current version, latency is far less important to me than the ability to use Wireshark and other simple protocol analyzers to snoop on MAVLink traffic. This is why I included the screen shots in my initial comment.

At this point, Wireshark is still my reason for pushing this PR along. If you don't like this approach and also don't want to leverage MAVLink framing detection, perhaps I should just give up and keep my code as a private fork.

Thanks for reading this long comment!

--Brad

olliw42 commented 1 year ago

@brad112358

yo, the led thing really annoyed me, simple but so much better now LOL

your "slightly simplified" version IMHO has in fact simplified quite a bit, and I guess it would not have raised the discussion (although I think that the dicussion was/is very usefull, at least on my side it sharpened my understanding). I actually would say it converged a bit towards points mentioned before, that's really great.

many thx for the detailed latency analysis. I never went into that much detail, especially I ignored any aspects of the wirless link.

first off, I'm sure that's not what you were implying, but just want to mention, your earlier (and also current) code did/does not detect the end of a MAVLink package, but the end of a burst of MAVLink packages. In fact, not even this can be said with 100%, due to the effects of the mLRS link rate, which produces time gaps since the it's rate is lower than the baudrate.

My reasoning goes much simpler: Waiting for the end of the burst will - typically - take more time than chopping the stream in shorter time slices. In other words, if a method is "tuned" such that udp packets are typically of size 21 - 121 bytes, or (the same or a different method) is "tuned" such that the size is typically 6 - 50 bytes, the latter will have (on average) less latency.

Ergo, I come to the conclusion, that the two current techniques are not really different concerning latency, they both can be tuned (by adjusting their parameters) to similar typcial packet sizes. I however would go with a 6-50 bytes tuning, since I cannot see an argument for why 21-121 should generally give a more reliable udp transmission than 6-50.

As you point out, your current technique does somewhat better in packing full MAVLinhk messages into one udp package. I don't see any significant advantage here however, since the parser on the recipient side will assemble correctly either way, i.e., all what does prevent successfull assembly are missing udp packets, but as much as I'm aware we have no argument that udp is more reliable for 21-121 than for 6-50.

So, at the end of the day it seems to me that the only difference is that there is one analysis tool which gets somewhat along better with one than the other method. If I read correctly that's also what you say. I'm not convinced that's a good argument, I must say. MissionPlanner/QGC will just not care, and that's what's relevant to any user. (also, the time-gap technique doesn't really achieve it's goal, but overal just does somewhat better, so it's not like a whoohaa effect :)).

To me, it's a tie. Not sure what to conclude from here on though.

Concerning analysis tools, I use a python script which is based on a modified version of pymavlink, and which reads data from a com. Most of the time this com is a usb-ttl adapter with only the tx connected to the line I want to spy. You also can spy what MissionPlanner/QGC get by using things like com2com. Technically the py can also read tcp and udp, but I never use that. Lastly, I often just add debug code, like in the example in the above. I simply never felt like having a need for a thing like wireshark, and now that you say it can't even parse across packet boundaries it looks even more useless to me. :) Maybe you also want to try something pymavlink based.

brad112358 commented 1 year ago

@Olliw42

It sounds like we might finally be converging with this PR. That's excellent!

your earlier (and also current) code did/does not detect the end of a MAVLink package, but the end of a burst of MAVLink packages. In fact, not even this can be said with 100%, due to the effects of the mLRS link rate, which produces time gaps since the it's rate is lower than the baud rate.

Actually, unless I misunderstand you, the last part of your statement is not accurate at all. mLRS never produces time gaps in the middle of a MAVLink message. This is because each MAVLink message is assembled and checked by your fast MAVLink library before it is sent by the Tx. The Tx always waits for a complete message before sending it on via the serial port. Of course, because multiple MAVLink messages may be sent together by the receiver over the radio link, the Tx may send us more than one MAVLink message in a burst. Except in transparent serial mode, a gap can only occur at the end of a MAVLink message. The first part of your statement is true. My code detects the end of the last MAVLink message in a burst of MAVLink messages.

My reasoning goes much simpler: Waiting for the end of the burst will - typically - take more time than chopping the stream in shorter time slices. In other words, if a method is "tuned" such that UDP packets are typically of size 21 - 121 bytes, or (the same or a different method) is "tuned" such that the size is typically 6 - 50 bytes, the latter will have (on average) less latency.

You seem to still be considering the latency of each byte, but what matters is only the latency of each full MAVLink message. Because the GCS also needs to assemble bytes into a full MAVLink message before it can be used, the only byte that latency matters for is the latency of the last byte in a MAVLink message. This is why my code averages slightly larger UDP payloads while still managing slightly lower maximum and average latency of MAVLink messages.

Ergo, I come to the conclusion, that the two current techniques are not really different concerning latency, they both can be tuned (by adjusting their parameters) to similar typical packet sizes

Did you miss where I explained that, except for increasing the baud rate, no tuning is possible with my strategy? Any parameter change which decreases the UDP payload size would not meet my goal of sending only one or more full MAVLink messages in a UDP message.

However, if we complicate the code a bit more, we can detect the end of every MAVLink message (not just the last one in a burst) and reduce this component of the latency to 0. But I wouldn't consider this to be just tuning.

As you point out, your current technique does somewhat better in packing full MAVLink messages into one UDP package. I don't see any significant advantage here however, since the parser on the recipient side will assemble correctly either way, i.e., all what does prevent successful assembly are missing UDP packets, but as much as I'm aware we have no argument that UDP is more reliable for 21-121 than for 6-50.

"somewhat" is not an apropriate word here at all.

As long as we don't send so many UDP messages that some are lost, I never made any claim that larger UDP payloads are more reliable. I did mention decreased interference with/from WiFi for a 2.4MHz radio link.

However, to me the ability to use a protocol analyzer to decode MAVLink messages is a very significant advantage! And I don't think that is just because of my background as a network protocol developer. I think the way a protocol analyzer can help any MAVLink user better understand what's going on is the reason the MAVLink library includes a Lua extension for Wireshark to decode these messages.

I think we shouldn't be too critical of Wireshark's limitations. There is a reason: UDP stands for "User Datagram Protocol". It was never intended to carry a stream of bytes. It is not expected that a short message would ever be split between 2 UDP packets. That's what TCP is for. Be cause of this, Wireshark's Lua based decode extension engine does not make it easy to decode a MAVLink message if it's split between two UDP messages and the Lua script which the MAVLink library includes does not attempt reassembly. In fact, I'm not aware of any network protocol analyzers which would assemble UDP messages to decode a higher layer message. What the GCS is doing in assembling UDP payloads back into a byte stream and then decoding a higher layer message from that byte stream is not the way UDP is normally used.

So, at the end of the day it seems to me that the only difference is that there is one analysis tool which gets somewhat along better with one than the other method.

Again, I strongly disagree to your use of the word "somewhat". Wireshark is vastly better with my version. With the original code, 0% of the MAVLink messages from the radio to the GCS could be decoded. Your current version happens to do a little better by virtue of larger payloads and perhaps 40% of the MAVLink messages can be decoded. With all versions of this PR, I see over 99.5% of the messages decoded. That is a huge advantage to me. The only reason we don't get 100% decode with this PR is probably a bug in an esp32 library.

Concerning analysis tools, I use a python script which is based on a modified version of pymavlink

I can see that this might be useful. Are you interested in sharing it?

But your script sounds quite different than a protocol analyzer which can interactively display a time stamped decode of all MAVLink messages, allowing drilling down to each part of the MAVLink message all the way down to byte level details from a high level summary.

I think a good protocol analyzer would have been very useful when developing things like mLRS or your MAVLink library. And I know from experience that it is invaluable when investigating the interaction between a flight controller and a GCS. Have you ever tried one?

Though it seems you don't personally find this PR as useful as I do, I don't see any negative impact and I highly doubt I'm the only mLRS user who will ever appreciate the ability to use a protocol analyzer.

Again, thanks for reading.

--Brad

olliw42 commented 1 year ago

some of your responses read to me as if you want to missread what I said. Not commenting one them.

mLRS never produces time gaps in the middle of a MAVLink message.

you pointed this out before, and I indeed wasn't clear: This is of course so in the CURRENT code.

What I failed to bring across: This doesn't necessarily have to be so. I'm playing with a version where the bytes are forwarded immediately when they are received, and the parser so to say runs "in parallel". The advantage would be that this indeed would cut down latency, i.e., the last byte of the message which arrives the recipient (GCS) will (on average) arrive sooner (and this could be applied also on the receiver side, there we have the same delay mechanism :)). This may be even more releveant when we would have a router inside the tx module. A side effect would be however that also corrupted data would be send (which are so to say filtered out by the parser currently, which ensures only full and valid mavlink frames). While this may sound like a disadvantage, I'm not yet sure it really is a disadvantage, since that's actually what e,.g. routers should do (by the spec).

The bottom line: This is so curently, but it doesn't have to be so, and it in fact may change.

Any parameter change which decreases the UDP payload size would not meet my goal of sending only one or more full MAVLink messages in a UDP message.

pl reread in what context I was making the tuning statements. Hint: not in the context of getting wireshark to work.

Wireshark is vastly better with my version.

well, I'm not going to argue here and thus take back all my comments concerning this :)

The fact however stands that this is to make one particular protocol analyszer happy which may be needed at best for very detailed indepth dev work, and not to make the functioning or user experience better, and which furthermore doesn't work as it would have to work in order to actually do what one wants to do (i.e. do what every mavlink parser does). I mean, I also do not leave all my debug code, probably taylor made, in place. I add it as I need it but keep it with my own.

Concerning it's presumed advantages all I say is we all have our preferred methods to get along. As I indicated I never felt I need "interactively display a time stamped decode of all MAVLink messages, allowing drilling down to each part of the MAVLink message all the way down to byte level details from a high level summary". As much as I understand it, this tool only may help with the mavlink traffic, but that's really the simplest part of mLRS.

If it would help with spying the LORA traffic and be a tool which would allow us diagnose errors and so on, when I would be excited, since that's the traffic which is the critical part in the game.

Are you interested in sharing it?

sure

although I must admit I'm hesitating now since it seems predictable that all you are going to say is that it isn't "interactively display a time stamped decode of all MAVLink messages, allowing drilling down to each part of the MAVLink message all the way down to byte level details from a high level summary" ... we all have our own preferred ways, and that's how I do it.

Have you ever tried one?

I think I did answer this more than one time now, didn't I :)

brad112358 commented 1 year ago

some of your responses read to me as if you want to missread what I said. Not commenting one them.

I find this a little insulting. Why would I want to missread what you say? If I appear to have misunderstood what you said, then, that's what happened. I hold no malice in technical discussion and I don't argue for the sake of argument and I assume you don't either.

What I failed to bring across: This doesn't necessarily have to be so. I'm playing with a version where the bytes are forwarded immediately when they are received

I wish you had mentioned this earlier so I might not have wasted my time on this conversation.

On the other hand, I think increasing the baud rate on both ends of the radio link would be an easier way to reduce latency. And as you have said in the past, code can always be changed, so I'd rather consider what we have today.

Have you ever tried one?

I think I did answer this more than one time now, didn't I :)

I thought you only said you had not used Wireshark. There are several other similar protocol analyzers. Since you haven't used any of them, I suspect you might find them more useful than you think if you ever do try one.

olliw42 commented 1 year ago

I wish you had mentioned this earlier so I might not have wasted my time on this conversation.

I'm sorry, all I can say.

And as you have said in the past, code can always be changed, so I'd rather consider what we have today.

at least I've mentioned now that all your wireshark efforts may break in future, so you won't have to wish I would have told you earlier ;)

I thought you only said you had not used Wireshark. There are several other similar protocol analyzers. Since you haven't used any of them, I suspect you might find them more useful than you think if you ever do try one.

well, I haven't read the question that general ... I use e.g. the protocol analyzer in the sigrok logical analyzer software, I also remember using a GPIB protocol analyzer, and a labview protocol analyzer, and ... I'm not sure how this would be helpfull for the discussion however ... concerning MAVLink, and this is how I read the wurstion, I think I have said what tool I use to do my analyzes :)

(and as simple you may call it, it has one really unbeatable advantage: I can code it to do what I need for a specifc need, without having to learn yet another "language", simply unbeatable)

olliw42 commented 1 year ago

is there anything to be added, or is the implementation in the current sketch sufficiently ok and the issue "solved"? Could this here be closed?

olliw42 commented 1 year ago

assuming it's all ok. Hence closing. If there are still points to resolve pl reopen or raise a new issue/pr. THX for bringing up the topic, which helped to improve the sketch! :)