Bug? | Weird Tx/Rx Disconnect

jlpoltrack commented 1 month ago

Latest firmware on R9M (Tx) and R9MX (Rx), within ~5 minutes of having serial connection get:

Tx reports Rx is disconnected (red LED), Rx appears connected (green LED)
Rx still outputs RC data over CRSF
Lua shows only Tx, no Telemetry elements

Switching to Mavlink instead of MavlinkX seems to make it go away. Issue seems similar to previously reported issue when using GCC 12. No problem with connection when there is not an active serial connection.

olliw42 commented 1 month ago

I think you guys also can check it's this part of the code, simply change this line https://github.com/olliw42/mLRS/blob/main/mLRS/Common/thirdparty/mavlinkx.h#L400 to if (false && ....

jlpoltrack commented 1 month ago

Yea, 10 minutes here with R9MX and looks good.

olliw42 commented 1 month ago

so bounty ... who finds the buffer overwrite

have changed the len fields to uint16 already, to prevent an issue from this ... hasn't helped ... I f.ing just can't see it ...

olliw42 commented 1 month ago

I give up for today ... just can't find the issue

might be we have to disable compression, without it just works for ever

brad112358 commented 1 month ago

I don't see it yet either. Though, the size of the len arguments and variables could obviously have caused some problems with the largest possible messages.

olliw42 commented 1 month ago

pushed some changes to main, with the u16 changes in mavlinkx.h, and teh betafpv & bandit added

the issue still exists it goes away when I do the if (0 && fmavx_config_g.compression_enable ... tweak on the receiver side (not needed on teh tx side)

I did test on Bandit & R9MX, need to see if it exists/goes away for Bandit & Generic900

which raises questions

since compression is enabled only for 19 Hz, are there cases beyond those we tested so far there it does occur. I specifically wonder about 2.4. Ghz at 19 Hz
is this the same as the gcc12 issue, i.e., does it go away also with the same change?

jlpoltrack commented 1 month ago

since compression is enabled only for 19 Hz, are there cases beyond those we tested so far there it does occur. I specifically wonder about 2.4. Ghz at 19 Hz

Seemingly okay on 2.4 | 19 Hz. Testing with DIY E28 F1 Tx and SuperD Rx.

rotorman commented 1 month ago

@rotorman out of desperation, you happen to have any idea on such an issue?

Is the problem only with FrSky gear (which I do not have)? Otherwise, no, I have not experienced mLRS drop the connection for an unknown reason.

jlpoltrack commented 1 month ago

is this the same as the gcc12 issue, i.e., does it go away also with the same change?

GCC 12 with compression disabled on R9MX seems okay...

olliw42 commented 1 month ago

@rotorman out of desperation, you happen to have any idea on such an issue?

Is the problem only with FrSky gear (which I do not have)? Otherwise, no, I have not experienced mLRS drop the connection for an unknown reason.

@rotorman I was actually hoping that you might have some insight into the typical traps when one thinks that a compiler seemingly doesn't do the code one thinks it should do according to the code lines ... the code looks all alright to me and it's not obvious why it would write into a wrong memory location ... yet it does.

olliw42 commented 1 month ago

is this the same as the gcc12 issue, i.e., does it go away also with the same change?

GCC 12 with compression disabled on R9MX seems okay...

so it's possible it's one and the same issue

olliw42 commented 1 month ago

I'm still constantly working on this, and yesterday for a moment I was kind of convinced I finally found the culprit, but sadly I still actually haven't ...

it appears that the issue is _fmavX_payload_compress(). All protections I've added to check for out of bounds read/writes don't trigger ... whuich makes it more misterious.

At some point I had the idea to not compress, but to set the MAVLINKX_FLAGS_IS_COMPRESSED flag on the receiver side, so that the tx side attempts to decompress non-compressed data ... it's a sort of a stress test to the parser&decompressor ... and the result is: CATASTROPHIC! The tx undergoes reboots, and sometimes will just stop ...

ergo, the parser+compressor are NOT stress resistant. I guess that's the way I'm going now, I guess chances are that this issue and the other issues are related.

olliw42 commented 4 weeks ago

@jlpoltrack @brad112358 @tmcadam

so, again me ... and a more desparate me

this issue kind of upholds any further progress with the code (so far we could get around by not using the gcc12, but unfortunately the issue has now creept into the main code)

my setup is still a RM Bandit Micro as tx and R9MX as rx.

as you may have seen, I have pushed few more changes. There was a further array which was too small and thus could have been a point for an out-of-range memory access, corrected it but doesn't help. You also can see that I added catchers to kind of every single access to an array. NONE of them ever triggered or triggers for me.

I like to summarize what seems to be the "facts":

when the issue appears, the tx disconnects but the rx remnains connected
the rx still sends and the tx still receives
however, the sync word (and potentially other data, just checked for the sync word) which is received by the tx is incorrect and remains incorrect further on (and hence the tx refuses all further received frames)
the situation cannot be cured by repowering the tx
the situation can only be cured by repowering the rx

these "facts" make me believe that the issue is in the receiver side code. I'm not sure this conclusion is 100% true however, I can't see a loophole to the argument but me not seeing the loophole might be exactly the reason for not finding the issue.

these appear to be also true (but might be missleading as reagrds to where the issue is located):

disabling mavlinkX or weaker, outcommenting #define MAVLINKX_COMPRESSION in the code for the receiver, makes the issue go away
disabling on the rx side only the compression but leaving on the decompression (by e.g. the if(0 && trick) does not help, the issue remains
the recent changes to the size of some pos field (u16 instead of u8) seem to have helped in the sense that it subjectively feels like the issue happens now less soon ... but it eventually still happens (sometimes I have to wait for long, but over night or over work day always made it happen so far)
I have done tests with the stack-checker enabled on the rx, and there is no stack overflow (the stack remains less than half field)(stack size os 1k)
I did some more little tetest, e.g. changed the size of the working buffer fmavx_in_buf from 300 to 1024, or forced the mavlinkX functions to not be inlined, etc,. None helped.

I can't understand why none of the out-of-range catchers trigger, yet it appears that the compression code causes the problem.

It is easy to see why the tx disconnects, one just needs to add some debug code to store and print out the sync_word received in the isr, which shows that it suddenly changes to incorrect values. This could be an issue in the tx code, but 4. & 5. suggest it's indeed the receiver sending frames with incorrect data. However: to the day I have not managed to see the receiver doing this. That is, wherever I added debug code to print out what should be the sync_word in the transmitted frame, it prints the correct value ... even when the issue has happened.

Is anyone of you exoperienced with code checker like Valgrind or others (just googled this LOL)? You know of some gcc flags we could/should use? (tried -Warray-bounds=2 but just found an msp error, which shouldn't have an effect)

for your info, here is a dump of the deb output on the Tx side, which shows how the received syncword changes when the issue happens. It undergoes a chain of different values until it always ends up in 0x5220. (0x7C85 is the synword generated from the default bind phrase). The entries are (1) sync word as received in the isr, (2) rxFrame.sync_word, (3) txFrame.sync_word. As said, I did not yet manage to add dbg code to the rx side which would show me that the rx is indeed sending frames with these sync words.

>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!r
>1!ri x05E2 x7C85 x7C85
>.1!ri x2C29 x7C85 x7C85
>1!ri x2C29 x7C85 x7C85
>1!ri x776F x7C85 x7C85
>1!ri x0340 x7C85 x7C85
>1!ri x0340 x7C85 x7C85
>1!ri x374B x7C85 x7C85
>1!ri x4043 x7C85 x7C85
>1!ri x4043 x7C85 x7C85
>1!ri x9DBF x7C85 x7C85
>1!ri x0100 x7C85 x7C85
>1!ri x0100 x7C85 x7C85
>1!ri xDB04 x7C85 x7C85
>1!ri xBF17 x7C85 x7C85
>1!ri xBF17 x7C85 x7C85
>1!ri x5DBD x7C85 x7C85
>1!ri x0C40 x7C85 x7C85
>1!ri x0C40 x7C85 x7C85
>1!ri xF93B x7C85 x7C85
>1!ri xEBBA x7C85 x7C85
>.1!ri xEBBA x7C85 x7C85
>1!ri x1763 x7C85 x7C85
>1!ri x0101 x7C85 x7C85
>1!ri x0101 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85

A second example. Shows the sequence is not always teh same, but I find it to alwayse end up ins 0x5220.

>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!r
>.1!ri x7C85 x7C85 x7C851<
>1!ri xFEFF x7C85 x7C85
>1!ri x6FFE x7C85 x7C85
>1!ri x4077 x7C85 x7C85
>1!ri x4077 x7C85 x7C85
>1!ri xAB45 x7C85 x7C85
>1!ri x43A7 x7C85 x7C85
>1!ri x43A7 x7C85 x7C85
>1!ri x5B40 x7C85 x7C85
>1!ri x014E x7C85 x7C85
>1!ri x014E x7C85 x7C85
>1!ri xE405 x7C85 x7C85
>1!ri x59F8 x7C85 x7C85
>1!ri x59F8 x7C85 x7C85
>1!ri xF444 x7C85 x7C85
>1!ri x0C40 x7C85 x7C85
>1!ri x0C40 x7C85 x7C85
>1!ri xA83B x7C85 x7C85
>1!ri x0077 x7C85 x7C85
>1!ri x0077 x7C85 x7C85
>1!ri x0101 x7C85 x7C85
>1!ri x0423 x7C85 x7C85
>1!ri x0423 x7C85 x7C85
>1!ri x6C05 x7C85 x7C85
>1!ri xDFC0 x7C85 x7C85
>1!ri xDFC0 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85

brad112358 commented 3 weeks ago

Wow. This is just plain nasty. I read your last comment just before I went to sleep last night and I'll be honest; it invaded my dreams. Unfortunately I didn't wake up with a solid hunch (which has happened to me on occasion) Your reasoning and conclusions all look sound to me.

Were you able to fix the decompress code so it doesn't crash when fed garbage (like you did by marking uncompressed messages as compressed)? It occurs to me that it and everything else in the receive path does need to be robust in the face of transmission errors as the various CRC checks can't eliminate absolutely all of these. Considering the evidence that you may have reduced the frequency of the errors makes me wonder if the reason you see these so much less frequently now is because the remaining cases are triggered by undetected transmission errors.

Unfortunately, I don't think Valgrind or any similar tool is applicable to tiny embedded platforms like ours. I have used/enhanced some much lighter weight custom buffer overrun and memory allocation/free debug libraries in the past, but they target mostly dynamically allocated memory and depended on having an MMU to catch access beyond or ahead of the allocated block, so are useless here. In short, I am not aware of any applicable tools.

olliw42 commented 3 weeks ago

I have this suspicion all time long, but I'm getting more and more convinced it's not actually the mavlinkx compression code, but something else which is just exposed by having it included

I've added this now (and some dbg code), to check more closely what it sends grafik and it runs now since > 6 h ... (with the other additions getting the issue could need 1h or so, but 6 h, that's long ...)

olliw42 commented 3 weeks ago

I would not mind if I would find the source of the problem in my dreams ... :D

brad112358 commented 3 weeks ago

Updating the sync word every time makes sense as a workaround, but obviously doesn't explain how it got corrupted in the first place.

You didn't respond about the decompression code. Doesn't this need to be fixed too? Granted, you probably don't have enough transmission errors in your test for bad data to make it past a 16 bit CRC in an hour since you were probably testing on the bench. But at the edge of our range, this could be a real problem. By my calculations, If an average of 1/4 of the radio messages have a couple of bit errors, it would only take about 87 minutes on average to encounter an undetected error which from what you said could be very bad.

olliw42 commented 3 weeks ago

this piece of code does NOT update the sync_word every time ... it just reads it into a variable which later is when used to print it out on dbg

ah, sorry. Yes, the compression catastrophy was a short term thing ... it seems to have went away with the u8 to u16 conversions, especially this line was a clear bug with u8, since len can be 255: https://github.com/olliw42/mLRS/blob/main/mLRS/Common/thirdparty/mavlinkx.h#L894. But I must admit I can't say I fully understood the error mode in detail of what I was describing.

I think that for as long as we kind of can be sure that the compression code does not (i) hangs up, (ii) writes in incorrect memory locations or (iii) yields lengths which make other code to write out of memory, we can accept any nonsense the compression code might do ... and I really had catcher for out-of-range like everywhere ... not a single every triggered

brad112358 commented 3 weeks ago

this piece of code does NOT update the sync_word every time ... it just reads it into a variable which later is when used to print it out on dbg.

Oops. Sorry. So, just another example of how this really is a heisenbug then.

olliw42 commented 3 weeks ago

so, after ca 8 h it now also failed ... in the Tx log I see exactly the same as reported, but in the rx log I see all sync_words which I print out to be perfectly ok ... I still can't manage to see it that the rx starts to send incorrect sync words ... despite the evidence a tx code bug?

I just don't get it

but we really need to find this, it holds up any further progress ...

brad112358 commented 3 weeks ago

How is this possible if the problem is on the receiver end which seems the case given restarting the Tx doesn't fix it. Somehow, the corruption happens in the SX1276 itself or during the SPI transfer?

brad112358 commented 3 weeks ago

Perhaps some sort of timing sensitive issue affecting the driver? The sequence of received incorrect sync words and that it gets stuck on 0x5220 really is strange though.

jlpoltrack commented 3 weeks ago

Perhaps some sort of timing sensitive issue affecting the driver? The sequence of received incorrect sync words and that it gets stuck on 0x5220 really is strange though.

Just a random thought when you mention timing - the R9MX clock config seems to use the (internal) MSI (and not the XO on the board) but at same time - SPI is a synchronous protocol so shouldn't matter (well in my mind) if there's variations in the SPI clock speed.

https://github.com/olliw42/mLRS/blob/main/mLRS/rx-R9MX-l433cb/Core/Src/main.cpp#L130-L141

brad112358 commented 3 weeks ago

Both debug output examples show reaching 0x5250 after 24 rejected messages which exactly matches our 1.25 second connection timeout which makes me wonder if the data we are incorrectly sending may actually be coming from our received messages.

brad112358 commented 3 weeks ago

Have we seen any indication of this issue or the problem with newer GCC versions on anything other than SX1276?

olliw42 commented 3 weeks ago

How is this possible if the problem is on the receiver end which seems the case given restarting the Tx doesn't fix it.

that's the logical contradiction I'm not getting along with all the the time ...

Both debug output examples show reaching 0x5250 after 24 rejected messages which exactly matches our 1.25 second connection timeout

oh, very well observed !! interesting, need to think about this

Have we seen any indication of this issue or the problem with newer GCC versions on anything other than SX1276?

given the similarities we had associated this to the same issue seen earlier with mavlinkX on gcc12. If I recall here it was also especially for sx128x and FLRC. @jlpoltrack correct?

jlpoltrack commented 3 weeks ago

I first noticed on SX126x / G4: https://github.com/olliw42/mLRS/issues/159#issuecomment-2241102917 I think this would've been 31 Hz or FSK.

Also later could reproduce on STM32F1 / SX128x / FLRC. Which indicated that it wasn't necessarily a compression thing...

olliw42 commented 3 weeks ago

just reread the oter thread, there is lots of talk of a tx crash ... so, are the sympthos actually kind of very different ... and the only "similarity" is a presumed relation to mavlinkX?

jlpoltrack commented 3 weeks ago

Yes, MAVlinkX was the common piece. Just highlighting why it was reported as a Tx crash, quote from other thread:

"The same issue appears - the connection looks good between Tx and Rx, however, as soon as the FC has initialized and starts emitting data the Tx crashes (no LEDs) and requires a power cycle. This doesn't seem to affect the Rx side - it just disconnects and shows a flashing red LED. Switching to Mavlink (instead of MavlinkX) shows the usual behavior."

brad112358 commented 3 weeks ago

Sounds like it was crashing pretty fast. I wonder if any of your changes so far might have also made gcc12 better

olliw42 commented 3 weeks ago

THX @jlpoltrack

so, indeed quite different error pattern ...

I too started to wonder if the gcc12 issue is now gone, there had been quite some bugs in the compression code removed, related to the u8 vs u16 thing

and I too now wonder if that is now "just" a sx1276 thing ... I guess we need to do some tests with the other sx hardware, using 19 Hz. Note though, running just few dozen minutes is not sufficient, one basically needs to run each test over night or during day work, like 12h or so ... it's just weired that it appears to work without compression ... I am very sure I did tests with compression outcommented with said long run times also, if sx1276 related, why when only on rx side not also on tx side the "facts" jsut don't piece together

olliw42 commented 3 weeks ago

so, started sx1262, mR900-30 receiver and tx kit, full erase then flashed with main as of now, changed power to min, set to 19 Hz and double checked mavlinkX let's see tomorrow morning ...

olliw42 commented 3 weeks ago

12 h later: still going strong, no indication whatsoever of an issue

EDIT: didn't do anything on this topic today, so had it let running, is now at 20 h and still going strong, I'll take this as proof of fact

I'm going to do this test also for the m24-30 stuff ... but I'm kind of sure to know aht the outcome will be

so, seems to be sx1276 specifc (a cocnluzsion draw already early on in the top!) ... or hardware specific ... or both, sx1276 and hardware

olliw42 commented 3 weeks ago

I went thorugh the thread and collected some info on hardware we got so far:

R9M and R9MX show issue 1.3.00 release looks good 1.3.01 self-compiled already bad

RM Bandit Micro and R9MX show issue "strange fact, seems to be an issue only with the sx1276 (or sx1276 based devices)" "it seems to exist for both the R9MX and Generic900"

rx Generic900 ->syncword becomes 0x85D5 R9MX -> syncword becomes 0x5220

R9M ACCST and R9MM latest pre-built pre-release firmware, all good "I switched to an R9MX receiver and quickly reproduced the problem. I'll try with a locally built version." "I reproduced the problem also on R9MX with a locally built version"

R9M + R9MM (Both STM32F1): Works R9M + R9M (Both STM32F1): Works R9M + R9MX (F1 + L4): Doesn't Work R9M + Generic (F1 + ESP8255): Doesn't Work Bandit + Generic (ESP32 + ESP8255): Doesn't Work

"Seemingly okay on 2.4 | 19 Hz. Testing with DIY E28 F1 Tx and SuperD Rx."

olliw42 commented 3 weeks ago

it was speculated that it could be spi communication with sx1276 the "nice" aspect of that is that it would explain why in the code I can't see the rx to change sync_word but some things don't fit here too

why only rx, not also tx
why not only sporadically, but systematically, one bad communication and all communication wrong hencforth

brad112358 commented 3 weeks ago

why not only sporadically, but systematically, one bad communication and all communication wrong hencforth Good point which I suppose means it could be the sx1276 that is in a bad state. Would be interesting to add a way to reinit only the sx1276 at will to see if that gets it going again.

olliw42 commented 3 weeks ago

I was actually thinking of adding a rescue mode ... but this would require me to be able to detect the failure state on the rx ... which I didn't manage yet

it would have to be strange bad state, since obviously it is still transmitting

olliw42 commented 3 weeks ago

I will also say that I think that we either find that issue in a not too long distance or we are going to face a tough decission

MSP gets along, and also DroneCAN might be close ... and I think a next beta version is overdue anyway ... if we don't want to postpone until infinity we might have to exclude all sx1276 targest from such an release ...

jlpoltrack commented 3 weeks ago

exclude all sx1276 targest from such an release

Has this been reproduced on MAVLink (non-X) + SX127x?

olliw42 commented 3 weeks ago

Has this been reproduced on MAVLink (non-X) + SX127x?

not by me, to me it seems outcommenting compression or disabling mavlinkX avoids the issue disabling mavlinkX is IMHO however not desirable, since our mavlink link robustness depends on it disabling compression would break backwards compatibility IMHO main problem is a bit that this might be just anecdotical ... we would have to do long time tests for kind of each combination to be on the safe side (which I kind of do now for the other targets, each day and night a new combo)

brad112358 commented 3 weeks ago

It has been slow, but, It seems to me that we (you) are still making progress in characterizing and isolating the problem.

I would suggest that we have not yet reached the point where we need to break compatibility or, worse, drop support for a fairly popular hardware platform. Isn't it easy enough to deprecate compression without breaking compatibility?

For the next release, I would suggest disabling compression only on the sending end, leaving the decompressing code in place and do this only for SX127x. If this avoids the problem in long running tests, we can, then recommend upgrading anything using SX127x and even if users upgrade only one end at a time, compatibility will be maintained and they will be no worse off than they are with the current release.

olliw42 commented 3 weeks ago

For the next release, I would suggest disabling compression only on the sending end, leaving the decompressing code in place

I had the same idea as this at least would leave backwards compatibility, but I found this to also show the issue. No viable option hence. IMHO.

olliw42 commented 3 weeks ago

so, did some long time tests for

mR24-30 tx kit & mR24-30 receiver: 14 h -> all good
mR24-30 tx kit & RM RPTD4 receiver: 119 h -> all good

I then wanted to change the tx module to the betafpv 1W micro ... and I just get CATASTROPHIC behavior

the RPTD4 receiver is quickly (couple dozen seconds) driven into a FAIL state (fast red blinking)
too often the connection will not start properly

it seems it's related to when being in the live-update main screen, comes later/rarer when in an other screen, so could suggest an OLED issue

FOLKS: WE HAVE SERIOUS PROBLEMS ALL AROUND

to me it seems that we rather will have to drop all ELRS targets, and the sx127x STM32 targets ... I frankly observe just too many issues. The Bandit and BetaFPV targets I think need to be removed ... it seems we have merged them too prematurely and they didn't had seen the sufficient testing. My feeling is that I hadn't seen the issues before becasue I hadn't really tried 19 Hz, as now.

olliw42 commented 3 weeks ago

as regards the issue itself, I start wondering if it's not an issue with the sx1276 driver which somehow over time crept in. Sometimes it feels to me like we are back before we had the AFC correction from Brad. Is the AFC stuff still working? just brain storming

jlpoltrack commented 3 weeks ago

Since it sounds like compression plays a role (in some way) - will this be removed in the next release (1.4) ?

olliw42 commented 3 weeks ago

Since it sounds like compression plays a role (in some way) - will this be removed in the next release (1.4) ?

I would not like to disable it for targets where it is just working fine.

we could make the breaking transition to disable it for all sx1276 targets ... but even this would not free us from very carefully testing any target (since attrebuting the issue to compressi is also just anecdotical, not based on a solid understanding)

it's just clear in my mind that we are going to face a severe decission in a very near future ... MSP is getting along, INAV 8.0 rc1 is near, and also DroneCAN is getting along, and AP4.6beta is near.

olliw42 / mLRS

Bug? | Weird Tx/Rx Disconnect #221