olliw42 / mLRS

2.4 GHz & 915/868 MHz & 433 MHz/70 cm LoRa based telemetry and radio link for remote controlled vehicles
GNU General Public License v3.0
266 stars 55 forks source link

dev ARQ (retransmit) #185

Open olliw42 opened 1 month ago

olliw42 commented 1 month ago

this is a dev/test branch for working on retransmission

this one is simplified in two ways, to not make it too complicated and facilitate easier testing:

@brad112358 this might interest you, given you showed interest in this before. I actually would much appreciate you cheking it out; your strength in finding the little loopholes/bugs would be useful :)

anyone else is of course also massively welcome !!!!!

olliw42 commented 1 month ago

a pic I made to help me mlrs-arq

brad112358 commented 1 month ago

Thanks for including the very helpful and, I think clear, diagram. So far, I've only looked at that, but I already have a few comments/observations/suggestions.

1) You seem to be sending back only ack or nack and not the received sequence number. I believe this is not optimal as we observe below.

2) Sequence number 5 shows sub-optimal behavior which is a direct result of 1). In the 8th response, we send a nack with no indication of which sequence number was lost, even though, we have previously received and accepted the frame which was lost.

3) Instead of responding with just ack or nack, it would be better to send back a received sequence number and to always just acknowledge the last successfully received data. In other words, it doesn't matter much that we haven't received the most recent message. What matters most is what was the last message we successfully did receive. This change would have eliminated an unnecessary retransmission for sequence number 5.

4) The diagram doesn't specify the size of the sequence number field, but they seem to be at least 3 bits. I think the sequence numbers don't need to be so large. In fact, I think they can be 1 bit. This is because we have alternating transmission direction with no chance of out of order reception. This reduces the overhead for the sequence number and has the added advantage that responding with the last received sequence number rather than ack or nack still only requires 1 bit.

5) What I describe above is actually the retransmission method ELRS uses (last time I checked) and I think it is probably optimal in terms of overhead. They can only use it in one direction because they don't send an equal number of frames in both directions, but we can use this method in both directions.

6) There might be some small utility to using sequence numbers larger than 1 bit or to also send the ack/nack flag in terms of allowing each side to more accurately estimate the loss in both directions.

7) I agree we should limit the number of retransmissions. Even with 1 bit sequence numbers, I think we can limit the number of retransmissions to any value we like as long as both sides agree what the limit is (both sides would need a counter for each direction).

Does this make sense to you? Am I missing something?

olliw42 commented 1 month ago

many thx for the comments

concerning all points related to seq not 1 bit: YES, 1 bit is fully sufficient for a basic mechanism. Eventually we may want to change. However, for purely historical reasons the seq number just happens to be 3 bit, and at this point I see no reason to change that. We know we could/will. Not a relevant point IMHO :)

there is btw always a situation which is not optional. See e.g seq 3

the protocol is just the standard protocol as on any web page (there are various names for it, so no name here)(all 1 bit). There are two versions, those which send an ack, and those which send the next desired number. The main challenge is handling the various non-protocol related states, like disconnects, frames which carry commands not serial data, etc. pp. Hence these states, and I figured that the send ack version is easier to handle these states.

it's possible that with > 1 bit one may reduce few edge cases

the problem with 5 is probably solved by using the other method, to send the next desired no. Hm. Maybe I shoud convert to that. EDIT: yes, I think that's what we should go to ...

brad112358 commented 1 month ago

Can you elaborate a bit on what cases in mLRS the ack/nack is better than sending the sequence number of the last received frame? (You are faster at edits than I am at replys) If I understand your edit correctly, that was my point. Is there any practical difference between sending the next desired number vs the last received number?

brad112358 commented 1 month ago

Also, 3 is only sub-optimal if you allow for sending ahead of acknowledged (whether by sequence number or ack) frames which requires more buffering and can result in out of order delivery.

I think such type of protocols are out of question

olliw42 commented 1 month ago

sorry, I edited your post ... grrrr, this damed github, not the first time I stepped into ths trap

olliw42 commented 1 month ago

Is there any practical difference between sending the next desired number vs the last received number?

the book keeping and state handling looked easier to me

brad112358 commented 1 month ago

the book keeping and state handling looked easier to me

To me, the opposite. Sending back the last received sequence number means you just check if the acknowledged sequence number matches what you last sent. If so, move on, if not, retransmit. The other way seems to require you to add when you respond and subtract when you compare. But, I suppose it's mostly a matter of how you think about it. Either way, there is not much state involved except for deciding when to give up on retransmission of a given frame.

Many of the algorithms found online and in text books are intended for more complex systems which don't just ping-pong messages at a constant interval like we do, so a very simple method can be optimal for us if we rule out buffering more than the most recent transmission as you have.

olliw42 commented 1 month ago

I was trying this initially, but then concluded for the ack, I will retry, could be that at the early stages I also had too much of how to abstract the code in mind. I didn't had sorted out the states initially. Anyway, it has benefits, so whatever what it's going to be so :)

olliw42 commented 1 month ago

@brad112358 so, changed it now to send the last received seq no as ack, instead of ack/nack-ing reception

seemingly works, in that it connects etc. pp, but the symptom described by @jlpoltrack also exists, i.e. MP shows lost packets ... so, appears something is still not working as expected ...

here the time plan for the changed protocol (note, ack is only 1 bit, seq is 3 bits)

grafik

olliw42 commented 1 month ago

with

#define USE_ARQ_TX_SIM_MISS 9 //9
#define USE_ARQ_RX_SIM_MISS 5 //5

I do see continuous packet losses in MP, MP tells pretty stably 95-96% link quality, so around every 20iest packet is lost

the mLRS LQ metric on the OLED display tells something around 75% ... not sure if that means that the mechanism is helping

I do my tests btw in 19 Hz mode (with a 2.4 GHz system)

brad112358 commented 1 month ago

And you didn't add a retransmission limit yet? So something must be wrong if MP is correct. When I get some time, I'll try to reproduce the problem with QGC.

brad112358 commented 1 month ago

Do you use Bluetooth or UDP or TCP WiFi or wired serial for the GCS connection?

olliw42 commented 1 month ago

And you didn't add a retransmission limit yet?

yes, no retransmission limit

So something must be wrong if MP is correct.

yes :)

Do you use Bluetooth or UDP or TCP WiFi or wired serial for the GCS connection?

wired serial connection, from tx serial via usb-ttl to PC

one potential source of problem which I have not yet ruled out is that the stream flow control isn't good enough, so that AP sends too many messages, so that some are dropped every once in a while ... I'm using 19 Hz, so there is some restriction. It's also curious that 95% is close to 1/19 ... can't see though how that could correlated. Some more tests should clear up some speculations.

brad112358 commented 1 month ago

Looks like that fixed it. QGC is now reporting 0 lost messages

olliw42 commented 1 month ago

Looks like that fixed it. QGC is now reporting 0 lost messages

YES :) @jlpoltrack made the relevant comment

not sure you also follow the discussion at discord

brad112358 commented 1 month ago

Well, It ran well for a while and then it started dropping a lot of messages and it seemed to get worse over time. I've power cycled both ends of the link (one at a time) and restarted the GCS, but it hasn't recovered. I'm not sure what happened.

brad112358 commented 1 month ago

I had the baud rate too high for the crap R9M inverter with weak pullup. Working fine at 115200 serial speed on the Tx

olliw42 commented 1 month ago

@brad112358 @jlpoltrack @tmcadam I think that this branch could get now serious review and testing. Your opinions and comments would be much appreciated. :)

olliw42 commented 1 month ago

MANY THX. Much appreciated

olliw42 commented 1 month ago

@brad112358 maybe you can help me out a bit with your brain. I struggle a bit over this frame_lost thing.

I do see generally two options:

  1. on the tx module side we also have a retry counter, which somehow is managed to be always in sync with the receiver side.
  2. the tx module can somehow retrieve the information when the receiver side gave up sending a palyoad and jumped to the next, which would signal that the mavlink parser needs to be reset

I don't like option 1. at all. First, it doesn't sound very robust to me on general grounds, i.e. looks difficult to ensure that. The sort of delicate code I like to avoid if possible. Especially when we do adaptive retries per payload, as we do (and IMHO want do).

So, leaves option 2.

The problem here is that I come to the conclusion that we need an extra bit, or some extra mechanism to reliably ensure the tx module knows when to reset.

Let's assume we have set retry_cnt = 1 (i.e. one retry), and let's start with the receiver sending payload no = 5. A sequence could then look as follows ('->' means from receiver to tx module, a 'x' means missed on tx side):

-> 5 => accept -> 5 x => missed (receiver has not moved on because it didn't got ack from tx module) -> 6 x => missed (receiver has given up since it was 2nd attempt, so went ahead with next payload) -> 6 x => missed (receiver has not moved on because it didn't got ack) -> 7 x => missed (receiver has given up since it was 2nd attempt, so went ahead with next payload) -> 7 => accept (receiver has not moved on because it didn't got ack)

Here the tx module can detect easily that a payload was missed since 7 - 5 > 1. This it could do even then the seq no would be just 1 bit.

However, you may now start to see where I see the issue. Let's consider this sequence:

-> 5 => accept -> 5 x => missed (receiver has not moved on because it didn't got ack) -> 6 x => missed (receiver has given up since it was 2nd attempt, so went ahead with next payload) -> 6 x => missed (receiver has not moved on because it didn't got ack) -> 7 x => missed (receiver has given up since it was 2nd attempt, so went ahead with next payload) -> 7 x => missed (receiver has not moved on because it didn't got ack) -> 8=0 => accept (receiver has given up since it was 2nd attempt, so went ahead with next payload)

Here the tx module can also detect easily that a payload was missed since 8 - 5 > 1 (ignoring the required modulo for the sake of simplicity). But it can do so ONLY because our seq no is more than 1 bit. If it were 1 bit, we could not distinguish between whether all is good or if the receiver has skipped some payloads.

Now we can go on, and eventually will see that even the 3 bit seq no will not be sufficient to unambigiously detect the required parser reset.

So, I come to the conclusion, we need to do something here.

Do you agree with that conclusion?


In case you agree:

So far, few approaches to resolve the situation came to my mind

  1. the simplest: we decide to be happy with the fact that with a chance of ca 1/8 we may not do a parser reset even though required
  2. the next "simplest": we add a bit which allows the receiver to tell the tx module that it needs to reset the parser
  3. I'm not sure if that is possible: the receiver doesn't advance the seq no beyond a difference of 7, so that the tx module could always detect by seq_no - seq_no_last > 1

What do you think? A better idea in mind?

brad112358 commented 1 month ago

What do you think? A better idea in mind?

I don't think it was better, but you may remember I did, long ago, propose a different solution which would work here at the expense of an extra byte. And I think I claimed at that time that it would also work well with retransmission.

However, lets ignore that and see if we can do this without increasing the overhead.

I think the simplest solution is to agree to always limit the number of retransmission attempts to less than 8. As long as the sender (I'll avoid saying Rx or Tx so it applies to either direction) never retransmits more than 7 times in a row, the receiving side can detect when and usually how many frames have been lost (better metrics).

  1. Count the number of frame times (FrameMissed events) and if > 7, set frame_lost=true
  2. When a frame is received, even if not accepted, reset this counter
  3. When a frame is accepted, compare the sequence number to the previous frame accepted and if a gap exists, set frame_lost=true

I think the addition of the counter prevents the issues you found with just comparing sequence numbers and we don't need to know exactly how many attempts the sender might make as long as both sides agree it will never be more than 7.

This does restrict us from reducing size of the sequence number, but I'm not sure we would do that anyway. And, I doubt we would ever want to retry more than 7 times because this would cause high frame loss rates in only one direction to cripple frame flow in the other direction.

What do you think?

olliw42 commented 1 month ago

I don't think it was better, but you may remember I did, long ago, propose a different solution which would work here at the expense of an extra byte. And I think I claimed at that time that it would also work well with retransmission.

tz. If I would have consistently done what I was suggesting this issue wouldn't exist. So, you are complaining that I was doing the parser reset which you were suggesting causing the issue. Pretty distorted logic, frankly.

I think the simplest solution is to agree to always limit the number of retransmission attempts to less than 8. As long as the sender (I'll avoid saying Rx or Tx so it applies to either direction) never retransmits more than 7 times in a row

It might be that you have missed the point. In the above retransmission was limited to 1, much less than 7. The issue is due to that more than twice the retransmission cnt frames can be lost.

Also, on the tx module side you don't know the retransmission count setting, so arbitrarily reseting at 7 won't resolve the issue at all.

Probably it's best to just add a bit.

brad112358 commented 1 month ago

tz. If I would have consistently done what I was suggesting this issue wouldn't exist. So, you are complaining that I was doing the parser reset which you were suggesting causing the issue. Pretty distorted logic, frankly.

I'm sorry. I don't fully understand what you are trying to say above but you seem to be interpreting what I wrote in a far more adversarial tone than I intended. Please re-read what I wrote. I did not complain about anything! This problem just reminded me of another approach (including a pointer to the start of the first MAVLink message) we once discussed to solve the problem of identifying the start of MAVLink messages in the face of lost radio frames. I don't know what to think of your "distorted logic" comment except that you sound annoyed and I don't understand why. I guess I shouldn't have mentioned it at all since I did say I don't think it is a good solution now.

brad112358 commented 1 month ago

Also, on the tx module side you don't know the retransmission count setting, so arbitrarily reseting at 7 won't resolve the issue at all.

I left out some assumptions and math details I should have discussed. A frame we accept may not have been the last of that sequence number to be sent, so waiting 7 frame times is not enough to account for up to 7 retries. But I still think my idea works without needing an extra bit.

Let me try explaining in a different way. First, as a thought experiment, let's assume we know that the maximum retry count on the sending side is 2, just as an example, and lets assume the receiving side just accepted sequence number 5 as in your example. It doesn't matter which frames are lost in which direction or even if all frames are lost in both directions. We know that the sending side is now still attempting to send either seq 5 (If it didn't get our ACK) or seq 6. If it doesn't get the ack for 5, it will give up sending seq 5 in no more than 2 more attempts (we don't know if the one we accepted was the first, second, or third time it sent seq 5). Then, we know that once it gives up on seq 5, it will send seq 6 no more than 3 times. Thus, from our perspective on the receiving side, we know that if we haven't received and accepted seq 6 within 5 (2+3) frame times, we must have lost something and we can set frame_lost=true.

The math is harder for the receiving side if we don't know how many times the sending side might resend so we need to put some bounds on it.

If the sending side is limited to 1 resend, we know that we have irrecoverably lost at least one frame if we receive nothing for 1+2=3 frame times. If the sending side is limited to 2 resends, we know that we have irrecoverably lost at least one frame if we receive nothing for 2+3=5 frame times. If the sending side is limited to 3 resends, we know that we have irrecoverably lost at least one frame if we receive nothing for 3+4=7 frame times. If the sending side is limited to 4 resends, we know that we have irrecoverably lost at least one frame if we receive nothing for 4+5=9 frame times. And so on.

If we further know the sending side will retry at least once if it doesn't receive an ACK for the first attempt, then we know that in the worst case where all messages are lost, the sending side will take at least 7+8=15 frame times to wrap sent sequence numbers back to the last one we accepted.

If we further restrict the sending side from retrying more than 7 times (8 sends total before giving up) then, if we receive nothing for 7+8=15 frame periods, we know we lost something.

This is the basis for the method I outlined in the 3 steps above.

Does the logic make sense to you now? I think we can handle a range of 1-7 retransmissions by setting the threshold for our frame counter to 15 lost frames on the receiving end. Or, we can handle a range of 0-3 retransmissions by setting the threshold for our frame counter to 7 lost frames on the receiving end.

olliw42 commented 1 month ago

I think the 3rd suggestion I made should also work, Avoids a bit and has no limitations on the retries.

brad112358 commented 4 weeks ago

The only one of your 3 approaches which I'm confident in is 1 and without adding the counter I proposed, it does have that small problem of not always detecting loss.

Does 2 actually work given the unreliable delivery of frames in both directions? How does the sending side know when to set the bit and when to clear it?

I'm also not understanding the details of how 3 would work. Can you provide some more details? Again, since messages may be lost in both directions, how does the sending side know when it is advancing the sequence number too far?

olliw42 commented 4 weeks ago
  1. doesn't work fully, like with 1., cases exist where it fails. An example sequence is that tx receives but rx doesn't ge the acks ... if that happens in a row of 7 or more times, the tx will not accept valid frames. I could push a code, but not sure it's worth it.

I too have doubts that 2., i.e. adding a bit, works without issues (though have no proof yet).

I don't like any of all these schemes. Too much involved, too much dependencies.

olliw42 commented 4 weeks ago

I think that the problem is that the whole retry thing doesn't work in the first place, and we are always struck by the issue that the seq is only 3 bit and the ack only 1 bit. Not talking about framelost/reset. I come to the conclusion, we can't ensure that we always do the right thing, once we have failure sequences which are longer than 8 times, the seq_no wraps and we have an issue. So, the only way out is to kind of model the behavior of the other side, and from that reconstruct the actual seq_no. It's in some way what your counter does. The seq_no would have to be longer than the number of frames for disconnect. I must admit I didn't anticipate how difficult this "catch the frame losses/reset" thing would be. I also admit that your added byte would resolve that easily. Issue with this is that the code to fill that byte would be pretty nasty. So, we can have nasty code here or there.

For the moment I did option 1. See next push.

Maybe we also simply need to come to the conclusion that if we have long failure sequences, that we simply have to accept that we can't recover everything to 100 %. I mean, long failure sequences mean pretty low LQ anyway. I think experience is anyway that below ca 40% the link is pretty "erratic" anyway, in the sense that it's not like smoothly around a stationary value. Maybe we just want too much.

olliw42 commented 4 weeks ago

another point: there are two optins for the returned seq_no. (a) the recipient can return which seq_no it has received, (b) the recipient can return which next seq_no it wants to have. (both are different from the inital ack/nack) Currently it's (a). I looked at both schemes and concluded they are equal. But I haven't taken into account the frame lost detection. And maybe my conclusion is also wrong. I.e., maybe one should think through that procedure. just to have mentioned that.

brad112358 commented 4 weeks ago

I don't think your (a) or(b) above can have any real impact on lost frame detection since they convey exactly the same information and either value can trivially be calculated from the other.

With regard to lost frame detection, I agree that finding a workable method depends on the behavior of the sending side so we need to nail that down first.

On the sending side, I think we will want more than a simple fixed limit on the number of retries for each frame.

1) The two cases where we resend a frame are a) loss of the frame which would have contained the ack and b) receipt of an ack for the older (not most recently sent) frame (call it nac). 2) When the ack is lost (a), we have no information to tell us if the last frame we sent was received so we resend, but we should probably have a small limit on the number of retries in this case because it is still possible that the frame was already received and we don't want to totally cripple message flow in one direction just because the other direction is unreliable. I would suggest that one or two retries may be all we want to do in this case. 3) When we do receive an ack which indicates the last frame we sent was lost (b) (nac), the limit on retries should be larger because we know we are not just wasting bandwidth by resending. But, there should still be a limit on the number of retries because the data contained in the frame gets more stale with each resend and eventually it is more useful to give up and send a newer message if we have one waiting to be sent. I'm thinking something on the order of 1 second worth of retries at the most. Or, perhaps how long we try should depend on how many messages are waiting behind it. Perhaps we should even discard more than one message if there are many messages waiting so we send fresher information. 4) So I think we need to increment two retry counters when we send a frame, one which is reset when we receive any message (ack or nac) and the other which is reset only when we receive an ack (for the last frame sent). 5) When we give up on retry and drop a frame, if it's because of a nac, we should also flush any partial MAVLink message and start the next frame with a new MAVLink message. If we gave up because of a lost ack, we should not flush because the frame we are discarding might have been received so we should continue to send the rest of the MAVLink message.

Does this sound correct?

olliw42 commented 4 weeks ago

interesting and creative idea, but I think it's way too complicated really. It also doesn't solve the main issue, AFAICS. I think the general strategy to have just 1 retry strikes a pretty good balance, is simple and also converges well into low LQ area. Doing few more retries when LQ is high or for the high speed link I think is also acceptable, but even here I would think one has to ask if it's worth the added complexity. I would think that the factual improvement in terms of user experience of these super sophisticated schemes will be minor. The big step is going to have one retry at all, IMHO. Retrying for 1 sec is out of my mind.

As regards frame loss, with the simple option 1. added we cover most situations well. It's failures really matter only in very low LQ situations, and here things are not nice anyway.

So, I guess I consider the topic largely "solved", i.e., my focus shifts towards doing more testing of the code.

I'm not sure you could find this agreeable.

My mind is also shifting towards adding a full reparsing to the mavlink parser, as this would resolve the frame loss issue completely. In fact, my mind is slightly different. Since quite some while, I would want to change the transmission such that the mavlink frame marker is not at the end of a mlrs frame payload, or split across two in the case of the two byte frame marker. Unfortunately, because of all the parsing and fifos etc it's not so easy to implement this. As a side effect of that it would also become easy to do the frameloss detection, pretty much like your old idea, namely simply scan the payload for the marker, compare to the parser expectation, and reset if no match. So, my mind is currently on the marker-not-at-end thing :)

olliw42 commented 3 weeks ago

@brad112358 I've pushed a code addition, which currently is nfc, but shows an idea which crossed my mind, to solve the parser reset thing

your old extra byte idea was to know where the next frame marker is, and compare to the parser state, and from that work out if something is missed. I mentioned in the above the idea, that the same could - in principle - be achieved by scanning the payload for a marker, and then do the same. Problem is a bit that the mavlinkX two-byte marker can be split accross frames (the normal mavlink marker is useless as it is just one byte). However, I know realized that this issue isn't really an issue because the mavlinkX parser does retract the header. Which means that if the parser is in the state of scanning the data in the mavlink frame, the marker is safely not split across frames. Likewise, if it is split, we can let the parser do it's job. Long story short, we only need to scan the data area of the mavlink frame for whether where is a marker. The added code does that. It doesn't currently do the actual reset, but the comments say where this should happen. Wanted to show that for you to see it, and possibly have and opinion on it :)

olliw42 commented 2 weeks ago

merged to main