olliw42 / mLRS

2.4 GHz & 915/868 MHz & 433 MHz/70 cm LoRa based telemetry and radio link for remote controlled vehicles
GNU General Public License v3.0
295 stars 67 forks source link

Bug? | Channels 13+ Odd behavior at low signal #78

Closed jlpoltrack closed 1 year ago

jlpoltrack commented 1 year ago

When I was flying, I noticed that a mode I had assigned to Channel 13 got triggered when I had low signal. Should this this be expected? Not 100% sure on the meaning of '4 channels with three steps (CH13 - CH16), 2 of them with a higher reliability margin (CH13, CH14)'

Here's a graph from the FC log with LQ in green and Channel 13 in blue and Channel 14 in red. Both of these channels have values that change at low LQ. Channel 13 is a trim switch and Channel 14 isn't even assigned on my Tx so I'm not sure why these would change (perhaps corruption?). I went through all of the other channels and didn't see this behavior.

image

olliw42 commented 1 year ago

this is strange

first off, concerning "with a higher reliability margin": The frame from tx to rx is split into four parts, a header, a first rc data section, a second rc data section, and lastly the serial payload section. The first two parts (header+1st rc data) are protected by their own crc (so-called crc1), and the whole frame is protected as usual by a final crc. See https://github.com/olliw42/mLRS/blob/main/mLRS/Common/frame_types.h#L102-L112. The idea is that the shorter length of the first two parts (header+1st rc data) may make it less likely to be disturbed than the long complete packet, so that there is a higher probablilty for this piece of data to be received correctly than for the full frame. And this indeed seems to work. There is btw also an early yt video there I talk about this.

concerning your issue, few thoughts

is it possible that this is a failsafe effect? I.e., that the failsafe setting for these channels are not their normal state, so that in failsafe they would flip? I admit the data you show wouldn't suggest this immediately, but I guess one would want to check.

is it possible that your switch settings are just at the edge of switching threshold, so that e.g. some noise in the channel data as it comes from the radio could do that. I e.g. note that the last blue flip is not really in an area of low LQ.

the only way I see for invalid data to make it through the syncword and crc checks is that the data got disturbed in a way that it is not picked up by the crc1. So, it hardly can be just a modifcation in the ch13/ch14 data, but there should be also some modification in some other channel(s). Since ch13/ch14 is just 3 state, a distubance would have the strong effect, while in the other channels it might be just a very small modification which slips attention. The point, if this is what is happening then, if you would analyze the other channels, you also should see some abnormity there.

jlpoltrack commented 1 year ago

Thanks for the details here, notes back:

is it possible that this is a failsafe effect? I.e., that the failsafe setting for these channels are not their normal state, so that in failsafe they would flip? I admit the data you show wouldn't suggest this immediately, but I guess one would want to check.

This was the first thing I checked but I have the Rx setup to 'no sig' for failsafe.

is it possible that your switch settings are just at the edge of switching threshold, so that e.g. some noise in the channel data as it comes from the radio could do that. I e.g. note that the last blue flip is not really in an area of low LQ.

This would make sense but Channel 14 (red above) wasn't configured on the Tx mixer at all - so it should be solid.

the only way I see for invalid data to make it through the syncword and crc checks is that the data got disturbed in a way that it is not picked up by the crc1. So, it hardly can be just a modifcation in the ch13/ch14 data, but there should be also some modification in some other channel(s). Since ch13/ch14 is just 3 state, a distubance would have the strong effect, while in the other channels it might be just a very small modification which slips attention. The point, if this is what is happening then, if you would analyze the other channels, you also should see some abnormity there.

I looked at another log and found another time this happened on Channel 13 (if I recall correctly, Channel 13 wasn't setup in the Tx mixer during this flight) and at the same time Channel 4 moved as well. The excursion was very short but synced. Green + Pink then Green + Blue in below images.

Channel 4 + 13 Channel 4 + 13 Zoom

Lastly, any issues if I borrow 4 bits from the 11-bit channels and reassign them to the 2-bit channels for testing (making them 4-bit):

image

jlpoltrack commented 1 year ago

@olliw42

olliw42 commented 1 year ago

fundamentally there is no reason to not do anything one wants the current state is the result of some lengthier discussion of what channles would satisfy most users, and to keep the rc1 small so that the crc1 works better pl realize that this is the kind of thing there when you ask 100 people you will get 110 opinions about what one should do ... democracy doesn't work here, it needs a dictator to get to something :)

concerning the strategy to changing things here and see if things get "better": It would be super uncomfortable if one would find that with some change it "suddenly" would work, based on pure empirical evidence, and not know & understand why the issue did happen for the current setup. It would not raise my confidence level at all. I.e., what I'm trying to say is that I would want to know why it happens currently, how it can pass the 16 bit crc checks, and - sorry to say - we don't have plenty of reports of this happening (in my understanding whsqz on discord kind of represents dozens of users).

espeically the how can it pass the 16bit crcs I find strange

jlpoltrack commented 1 year ago

espeically the how can it pass the 16bit crcs I find strange

Definitely strange but I suppose not impossible. Looking at what happened it seems:

Another possibility is the issue started on the Tx side but I find this very unlikely and wouldn't explain the correlation with low link quality.

Is there anything we'd want to try (other than how to recreate)? Switching to the higher coding rate was trying to prevent against interference which I thought was the likely cause here.

jlpoltrack commented 1 year ago

One idea to make the data for crc1 smaller:

Move status after rc1: https://github.com/olliw42/mLRS/blob/main/mLRS/Common/frame_types.h#L106 Move this logic for rc1: https://github.com/olliw42/mLRS/blob/cbb6965bdc7e7762d2de742fac4efd44368eed5a/mLRS/CommonRx/mlrs-rx.cpp#L380-L384 up to the top of the 'process_received_frame' https://github.com/olliw42/mLRS/blob/cbb6965bdc7e7762d2de742fac4efd44368eed5a/mLRS/CommonRx/mlrs-rx.cpp#L371

crc1 would then only be applied to sync word and rc1 (8 bytes instead of 13).

olliw42 commented 1 year ago

concerning such ideas these and other changes may all help to mitigate the impact/effect of interefernce BUT, IMHO, we just cannot allow that interference produces incorrect rc signal values, it only can be "good values" or "reject values", but not "partially valid/invalid values" ergo, whatever improvement may help with reducing the impact of any interfernce is a nice goodie, but not what we need to understand or work out

given the fact that essentially all other rc links don't appear to have that problem, and they all use a similar crc, I just can't believe that it is soo likely that corrupted data can pass the crc check. I also just can't believe that it should make a heck of a difference if the crc1 is testing 8 bytes or 13 bytes. It's a 16 bit crc after all. However, that's of course just my feeling.

Anyway, I guess we should not ignore the possibility that the crc check is indeed good, and that the issue is indeed somewhere else in the code, where we would not expect it currently. In short, what would we conclude if we would take the crc check to be 100% reliable? In some way I find that easier to go with than accepting the the crc16 should be soo bad. ;)

Btw, you correlate the issue with low LQ, but you show two instances where LQ is not low. So, I think, we also cannot make this assumption, i.e., should not take "low LQ" as a hint towards the possible source of the issue.

Btw II, if talking about ways to mitigate, one probably can nearly remove the issue by a filter, which allows larger changes to happen only if observed e.g. two times. This would add delay, but only to big step-wise changes, so would not affect much controlling the vehicle. But I really would consider such things, or different SFs or CRs or frame structures, etc pp, only once we know the issue is not elsewhere in the code :).

For the moment, my line of reasoning will be: what would we conclude if we would take the crc check to be 100% reliable?

not sure this makes sense

olliw42 commented 1 year ago

@jlpoltrack

concerning the crc16, I think it is not hard to proof if or if not it could allow the false rc data to slip through. One just can take the data and calculate the crc, then modify the data as seen, recalculate the crc16, and then see if it's the same. I couldn't do it for the specific cases you have shown, since I haven't the data, but I used few of my own frames, and modified the rc data as you described, and in all (the few) cases I did this the crc16 did not come out the same. So, in addition to the modification of the rc data also the crc16 field would have to be distorted in very specific ways, or other bits in the header, for the false rc data to pass. Maybe you can confirm/disconfirm this by own test calculations.

so, for the moment, I consider it unlikely that it is indeed an issue with the ota transmission channel itself.

This implies that the issue is either on the tx side or the rx side. I somehow suspect the rx side, and currently focus my investigations on this.

In fact, while totally speculative, I look at if it's an issue with ArduPilot. AP doesn't provide a specific CRSFv2 setting, so always does this baudrate thing etc. pp., and I do indeed see "strange" crsf-related messages in MissionPlanner. While I can't say I momentarily could explain in detail what may go worng, it just rasies my suspicision. Further reasons for me to look at this are that it may explain why it happens at low LQ values, and because AP has some rc protocol issues they constantly have to patch (just recently had been again a big one).

I'm trying to reproduce the issue, but wasn't yet successful. Maybe also because I lack some info

Edit:

Another way to test the rx speculation would be to not use the crsf channel but use RC_CHANNELS_OVERRIDE to tell AP the rc data. That is, the question would be, if you would enable RC_CHANNELS_OVERRIDE in mLRS, woudl you then still get these false rc data issues?

jlpoltrack commented 1 year ago

My setup:

Another way to test the rx speculation would be to not use the crsf channel but use RC_CHANNELS_OVERRIDE to tell AP the rc data. That is, the question would be, if you would enable RC_CHANNELS_OVERRIDE in mLRS, woudl you then still get these false rc data issues?

I like this idea but do have one concern - do you know how the failsafe behavior works when using this option? A quick search turned up this topic which doesn't give too much confidence: https://discuss.ardupilot.org/t/detect-rc-failsafe-while-sending-rc-override-messages/89787

Are you thinking the issue might be specifically with CRSF protocol in some way? If so, it's an easy change in the lua to use SBUS for the rc - I'd be more comfortable trying this first.

One other thing I noticed in the log was I was seeing values out of range I should be able to produce. With the current setup my max rc values is ~1994, see CH4 | Yaw:

image

However the log showed values over 2000 (time scale is pretty short as well, I don't think I can input commands this fast):

image

olliw42 commented 1 year ago

MANY thx for the details

good it's also not IOMCU, should give better chances for me to see the same (I'm using a MatekH743, wo IOMCU)

Arduplane 4.4.0 is good to, I can use that also (i.e. ArduCopter 4.4), and I also have SBUS_NI, CRSF yeah, should be 23 not 29, just wanted to double check ;) so I should somehow be able to also get this, let's see

don't know how failsafe and rc_override cooperate, never used that really except for testing. But I know that Brad is using this all time, and I think also VRquaeler is using this, isn't he? Maybe they know.

SBUS could be worthwhile to check too ... it's not exactly the same as it also goes through the RCProtocol library etc. pp, while rc_override really goes quite a different path. On the other hand, sbus also would avoid potential issues with this crsf baudrate negotiation and the pings and all that bi-directional crsf stuff. So, I think, yes, should be an interesting datapoint :)

over 2000 .... oh, that's interesting indeed, need to look at the codes what this could mean ... looks like some overflow somewhere

jlpoltrack commented 1 year ago

Just did a 60 minute flight on 0.3.27 on a different airframe and didn't see this issue (did do the RC calibration before the flight). Will follow-up when I try out the same airframe that had the issue.

Edit: Looks fine on airframe that experienced this before with 0.3.27, will try out the MAVLinkX branch again next.

Edit2: Also had upgraded to Arduplane 4.4.0 Beta 2 on both planes.

olliw42 commented 1 year ago

@jlpoltrack

thx for that

so, it now - somehow - seems to be all ok on both airframes, also the one giving troubles before?

on the troubled airframe, you did i. upgrade to mLRS 0.3.27 ii. upgrade of AP to 4.4.0 beta2 iii. RC calibration ???? did you do that for this frame too ???

it's somewhat uncomfortable that the issue just seems to have gone away without us knowing what it was I can't see anything in 0.3.27 which would make it go away In 4.4.0 beta2 there is the RC bug correction included (fixed a bug in RC input handling on the IOMCU), but it's supposedly only affecting IOMCU. The PR does one change in teh main code but it's hard to believe it would be it. (https://github.com/ArduPilot/ardupilot/pull/23848/files#diff-bf454e0c0473f1a4d6a4ac8b7b431977e7fd0ba83f1d35675fca0380985808d4https://github.com/ArduPilot/ardupilot/pull/23848/files#diff-bf454e0c0473f1a4d6a4ac8b7b431977e7fd0ba83f1d35675fca0380985808d4) So, the primary suspect would be RC calibration...

if you did an RC calibration on this airframe, I guess it would be interesting to revert it, i.e., set the MIN;MAX;TRIM to the old values, and see if the issue comes back

jlpoltrack commented 1 year ago

on the troubled airframe, you did i. upgrade to mLRS 0.3.27 ii. upgrade of AP to 4.4.0 beta2 iii. RC calibration ???? did you do that for this frame too ???

Yes, I did all of these things to the airframe that saw the issue before. I can certainly apply all of the old rcin settings and try again and share what I find before trying out the MAVLInkX branch (where I saw the issue).

olliw42 commented 1 year ago

MAVLInkX branch (where I saw the issue)

ah, maybe it's an issue in here ... I need to check this then :) maybe some buffer out of range writes or so ...

vrquaeler commented 1 year ago

Want to reproduce the issue and have 2 questions:

Thanks.

jlpoltrack commented 1 year ago
  • The issue appeared only on the MavlinkX branch?

Yes, that's correct

  • Which commit you build from?

I am not 100% sure, but I believe this one (I usually build right before flying, so the dates makes sense) https://github.com/olliw42/mLRS/commit/5db214795c497e1aef35fec3a170b129f83a7a12

A couple of quick notes. I haven't tried out the MAVLinkX branch recently and I haven't seen this on 0.3.27 after many hours.

olliw42 commented 1 year ago

it woud be important however to determine if this is really mavlinkX specific or not (I can't see a reason why it could relate to mavlinkX), or if it is < 0.3.27 specific, or dependig on other conditions

I will update the mavlinkX branch ASAP

@vrquaeler many thx that you are willing to trying to reproduce this, much appreciated

as much as I know this here is the only report of this issue so far (I couldn't reproduce it)

jlpoltrack commented 1 year ago

Closing, since can't seem to reproduce this currently.