Open jlpoltrack opened 1 month ago
I think you guys also can check it's this part of the code, simply change this line https://github.com/olliw42/mLRS/blob/main/mLRS/Common/thirdparty/mavlinkx.h#L400 to if (false && ....
Yea, 10 minutes here with R9MX and looks good.
so bounty ... who finds the buffer overwrite
have changed the len fields to uint16 already, to prevent an issue from this ... hasn't helped ... I f.ing just can't see it ...
I give up for today ... just can't find the issue
might be we have to disable compression, without it just works for ever
I don't see it yet either. Though, the size of the len arguments and variables could obviously have caused some problems with the largest possible messages.
pushed some changes to main, with the u16 changes in mavlinkx.h, and teh betafpv & bandit added
the issue still exists
it goes away when I do the if (0 && fmavx_config_g.compression_enable ...
tweak on the receiver side (not needed on teh tx side)
I did test on Bandit & R9MX, need to see if it exists/goes away for Bandit & Generic900
which raises questions
- since compression is enabled only for 19 Hz, are there cases beyond those we tested so far there it does occur. I specifically wonder about 2.4. Ghz at 19 Hz
Seemingly okay on 2.4 | 19 Hz. Testing with DIY E28 F1 Tx and SuperD Rx.
@rotorman out of desperation, you happen to have any idea on such an issue?
Is the problem only with FrSky gear (which I do not have)? Otherwise, no, I have not experienced mLRS drop the connection for an unknown reason.
- is this the same as the gcc12 issue, i.e., does it go away also with the same change?
GCC 12 with compression disabled on R9MX seems okay...
@rotorman out of desperation, you happen to have any idea on such an issue?
Is the problem only with FrSky gear (which I do not have)? Otherwise, no, I have not experienced mLRS drop the connection for an unknown reason.
@rotorman I was actually hoping that you might have some insight into the typical traps when one thinks that a compiler seemingly doesn't do the code one thinks it should do according to the code lines ... the code looks all alright to me and it's not obvious why it would write into a wrong memory location ... yet it does.
- is this the same as the gcc12 issue, i.e., does it go away also with the same change?
GCC 12 with compression disabled on R9MX seems okay...
so it's possible it's one and the same issue
I'm still constantly working on this, and yesterday for a moment I was kind of convinced I finally found the culprit, but sadly I still actually haven't ...
it appears that the issue is _fmavX_payload_compress()
. All protections I've added to check for out of bounds read/writes don't trigger ... whuich makes it more misterious.
At some point I had the idea to not compress, but to set the MAVLINKX_FLAGS_IS_COMPRESSED flag on the receiver side, so that the tx side attempts to decompress non-compressed data ... it's a sort of a stress test to the parser&decompressor ... and the result is: CATASTROPHIC! The tx undergoes reboots, and sometimes will just stop ...
ergo, the parser+compressor are NOT stress resistant. I guess that's the way I'm going now, I guess chances are that this issue and the other issues are related.
@jlpoltrack @brad112358 @tmcadam
so, again me ... and a more desparate me
this issue kind of upholds any further progress with the code (so far we could get around by not using the gcc12, but unfortunately the issue has now creept into the main code)
my setup is still a RM Bandit Micro as tx and R9MX as rx.
as you may have seen, I have pushed few more changes. There was a further array which was too small and thus could have been a point for an out-of-range memory access, corrected it but doesn't help. You also can see that I added catchers to kind of every single access to an array. NONE of them ever triggered or triggers for me.
I like to summarize what seems to be the "facts":
these "facts" make me believe that the issue is in the receiver side code. I'm not sure this conclusion is 100% true however, I can't see a loophole to the argument but me not seeing the loophole might be exactly the reason for not finding the issue.
these appear to be also true (but might be missleading as reagrds to where the issue is located):
#define MAVLINKX_COMPRESSION
in the code for the receiver, makes the issue go awayif(0 &&
trick) does not help, the issue remainsI can't understand why none of the out-of-range catchers trigger, yet it appears that the compression code causes the problem.
It is easy to see why the tx disconnects, one just needs to add some debug code to store and print out the sync_word received in the isr, which shows that it suddenly changes to incorrect values. This could be an issue in the tx code, but 4. & 5. suggest it's indeed the receiver sending frames with incorrect data. However: to the day I have not managed to see the receiver doing this. That is, wherever I added debug code to print out what should be the sync_word in the transmitted frame, it prints the correct value ... even when the issue has happened.
Is anyone of you exoperienced with code checker like Valgrind or others (just googled this LOL)? You know of some gcc flags we could/should use? (tried -Warray-bounds=2 but just found an msp error, which shouldn't have an effect)
for your info, here is a dump of the deb output on the Tx side, which shows how the received syncword changes when the issue happens. It undergoes a chain of different values until it always ends up in 0x5220. (0x7C85 is the synword generated from the default bind phrase). The entries are (1) sync word as received in the isr, (2) rxFrame.sync_word, (3) txFrame.sync_word. As said, I did not yet manage to add dbg code to the rx side which would show me that the rx is indeed sending frames with these sync words.
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!r
>1!ri x05E2 x7C85 x7C85
>.1!ri x2C29 x7C85 x7C85
>1!ri x2C29 x7C85 x7C85
>1!ri x776F x7C85 x7C85
>1!ri x0340 x7C85 x7C85
>1!ri x0340 x7C85 x7C85
>1!ri x374B x7C85 x7C85
>1!ri x4043 x7C85 x7C85
>1!ri x4043 x7C85 x7C85
>1!ri x9DBF x7C85 x7C85
>1!ri x0100 x7C85 x7C85
>1!ri x0100 x7C85 x7C85
>1!ri xDB04 x7C85 x7C85
>1!ri xBF17 x7C85 x7C85
>1!ri xBF17 x7C85 x7C85
>1!ri x5DBD x7C85 x7C85
>1!ri x0C40 x7C85 x7C85
>1!ri x0C40 x7C85 x7C85
>1!ri xF93B x7C85 x7C85
>1!ri xEBBA x7C85 x7C85
>.1!ri xEBBA x7C85 x7C85
>1!ri x1763 x7C85 x7C85
>1!ri x0101 x7C85 x7C85
>1!ri x0101 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
A second example. Shows the sequence is not always teh same, but I find it to alwayse end up ins 0x5220.
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!ri x7C85 x7C85 x7C851<
>1!r
>.1!ri x7C85 x7C85 x7C851<
>1!ri xFEFF x7C85 x7C85
>1!ri x6FFE x7C85 x7C85
>1!ri x4077 x7C85 x7C85
>1!ri x4077 x7C85 x7C85
>1!ri xAB45 x7C85 x7C85
>1!ri x43A7 x7C85 x7C85
>1!ri x43A7 x7C85 x7C85
>1!ri x5B40 x7C85 x7C85
>1!ri x014E x7C85 x7C85
>1!ri x014E x7C85 x7C85
>1!ri xE405 x7C85 x7C85
>1!ri x59F8 x7C85 x7C85
>1!ri x59F8 x7C85 x7C85
>1!ri xF444 x7C85 x7C85
>1!ri x0C40 x7C85 x7C85
>1!ri x0C40 x7C85 x7C85
>1!ri xA83B x7C85 x7C85
>1!ri x0077 x7C85 x7C85
>1!ri x0077 x7C85 x7C85
>1!ri x0101 x7C85 x7C85
>1!ri x0423 x7C85 x7C85
>1!ri x0423 x7C85 x7C85
>1!ri x6C05 x7C85 x7C85
>1!ri xDFC0 x7C85 x7C85
>1!ri xDFC0 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
>1!ri x5220 x7C85 x7C85
Wow. This is just plain nasty. I read your last comment just before I went to sleep last night and I'll be honest; it invaded my dreams. Unfortunately I didn't wake up with a solid hunch (which has happened to me on occasion) Your reasoning and conclusions all look sound to me.
Were you able to fix the decompress code so it doesn't crash when fed garbage (like you did by marking uncompressed messages as compressed)? It occurs to me that it and everything else in the receive path does need to be robust in the face of transmission errors as the various CRC checks can't eliminate absolutely all of these. Considering the evidence that you may have reduced the frequency of the errors makes me wonder if the reason you see these so much less frequently now is because the remaining cases are triggered by undetected transmission errors.
Unfortunately, I don't think Valgrind or any similar tool is applicable to tiny embedded platforms like ours. I have used/enhanced some much lighter weight custom buffer overrun and memory allocation/free debug libraries in the past, but they target mostly dynamically allocated memory and depended on having an MMU to catch access beyond or ahead of the allocated block, so are useless here. In short, I am not aware of any applicable tools.
I have this suspicion all time long, but I'm getting more and more convinced it's not actually the mavlinkx compression code, but something else which is just exposed by having it included
I've added this now (and some dbg code), to check more closely what it sends and it runs now since > 6 h ... (with the other additions getting the issue could need 1h or so, but 6 h, that's long ...)
I would not mind if I would find the source of the problem in my dreams ... :D
Updating the sync word every time makes sense as a workaround, but obviously doesn't explain how it got corrupted in the first place.
You didn't respond about the decompression code. Doesn't this need to be fixed too? Granted, you probably don't have enough transmission errors in your test for bad data to make it past a 16 bit CRC in an hour since you were probably testing on the bench. But at the edge of our range, this could be a real problem. By my calculations, If an average of 1/4 of the radio messages have a couple of bit errors, it would only take about 87 minutes on average to encounter an undetected error which from what you said could be very bad.
this piece of code does NOT update the sync_word every time ... it just reads it into a variable which later is when used to print it out on dbg
ah, sorry. Yes, the compression catastrophy was a short term thing ... it seems to have went away with the u8 to u16 conversions, especially this line was a clear bug with u8, since len can be 255: https://github.com/olliw42/mLRS/blob/main/mLRS/Common/thirdparty/mavlinkx.h#L894. But I must admit I can't say I fully understood the error mode in detail of what I was describing.
I think that for as long as we kind of can be sure that the compression code does not (i) hangs up, (ii) writes in incorrect memory locations or (iii) yields lengths which make other code to write out of memory, we can accept any nonsense the compression code might do ... and I really had catcher for out-of-range like everywhere ... not a single every triggered
this piece of code does NOT update the sync_word every time ... it just reads it into a variable which later is when used to print it out on dbg.
Oops. Sorry. So, just another example of how this really is a heisenbug then.
so, after ca 8 h it now also failed ... in the Tx log I see exactly the same as reported, but in the rx log I see all sync_words which I print out to be perfectly ok ... I still can't manage to see it that the rx starts to send incorrect sync words ... despite the evidence a tx code bug?
I just don't get it
but we really need to find this, it holds up any further progress ...
How is this possible if the problem is on the receiver end which seems the case given restarting the Tx doesn't fix it. Somehow, the corruption happens in the SX1276 itself or during the SPI transfer?
Perhaps some sort of timing sensitive issue affecting the driver? The sequence of received incorrect sync words and that it gets stuck on 0x5220 really is strange though.
Perhaps some sort of timing sensitive issue affecting the driver? The sequence of received incorrect sync words and that it gets stuck on 0x5220 really is strange though.
Just a random thought when you mention timing - the R9MX clock config seems to use the (internal) MSI (and not the XO on the board) but at same time - SPI is a synchronous protocol so shouldn't matter (well in my mind) if there's variations in the SPI clock speed.
https://github.com/olliw42/mLRS/blob/main/mLRS/rx-R9MX-l433cb/Core/Src/main.cpp#L130-L141
Both debug output examples show reaching 0x5250 after 24 rejected messages which exactly matches our 1.25 second connection timeout which makes me wonder if the data we are incorrectly sending may actually be coming from our received messages.
Have we seen any indication of this issue or the problem with newer GCC versions on anything other than SX1276?
How is this possible if the problem is on the receiver end which seems the case given restarting the Tx doesn't fix it.
that's the logical contradiction I'm not getting along with all the the time ...
Both debug output examples show reaching 0x5250 after 24 rejected messages which exactly matches our 1.25 second connection timeout
oh, very well observed !! interesting, need to think about this
Have we seen any indication of this issue or the problem with newer GCC versions on anything other than SX1276?
given the similarities we had associated this to the same issue seen earlier with mavlinkX on gcc12. If I recall here it was also especially for sx128x and FLRC. @jlpoltrack correct?
I first noticed on SX126x / G4: https://github.com/olliw42/mLRS/issues/159#issuecomment-2241102917 I think this would've been 31 Hz or FSK.
Also later could reproduce on STM32F1 / SX128x / FLRC. Which indicated that it wasn't necessarily a compression thing...
just reread the oter thread, there is lots of talk of a tx crash ... so, are the sympthos actually kind of very different ... and the only "similarity" is a presumed relation to mavlinkX?
Yes, MAVlinkX was the common piece. Just highlighting why it was reported as a Tx crash, quote from other thread:
"The same issue appears - the connection looks good between Tx and Rx, however, as soon as the FC has initialized and starts emitting data the Tx crashes (no LEDs) and requires a power cycle. This doesn't seem to affect the Rx side - it just disconnects and shows a flashing red LED. Switching to Mavlink (instead of MavlinkX) shows the usual behavior."
Sounds like it was crashing pretty fast. I wonder if any of your changes so far might have also made gcc12 better
THX @jlpoltrack
so, indeed quite different error pattern ...
I too started to wonder if the gcc12 issue is now gone, there had been quite some bugs in the compression code removed, related to the u8 vs u16 thing
and I too now wonder if that is now "just" a sx1276 thing ... I guess we need to do some tests with the other sx hardware, using 19 Hz. Note though, running just few dozen minutes is not sufficient, one basically needs to run each test over night or during day work, like 12h or so ... it's just weired that it appears to work without compression ... I am very sure I did tests with compression outcommented with said long run times also, if sx1276 related, why when only on rx side not also on tx side the "facts" jsut don't piece together
so, started sx1262, mR900-30 receiver and tx kit, full erase then flashed with main as of now, changed power to min, set to 19 Hz and double checked mavlinkX let's see tomorrow morning ...
12 h later: still going strong, no indication whatsoever of an issue
EDIT: didn't do anything on this topic today, so had it let running, is now at 20 h and still going strong, I'll take this as proof of fact
I'm going to do this test also for the m24-30 stuff ... but I'm kind of sure to know aht the outcome will be
so, seems to be sx1276 specifc (a cocnluzsion draw already early on in the top!) ... or hardware specific ... or both, sx1276 and hardware
I went thorugh the thread and collected some info on hardware we got so far:
R9M and R9MX show issue 1.3.00 release looks good 1.3.01 self-compiled already bad
RM Bandit Micro and R9MX show issue "strange fact, seems to be an issue only with the sx1276 (or sx1276 based devices)" "it seems to exist for both the R9MX and Generic900"
rx Generic900 ->syncword becomes 0x85D5 R9MX -> syncword becomes 0x5220
R9M ACCST and R9MM latest pre-built pre-release firmware, all good "I switched to an R9MX receiver and quickly reproduced the problem. I'll try with a locally built version." "I reproduced the problem also on R9MX with a locally built version"
R9M + R9MM (Both STM32F1): Works R9M + R9M (Both STM32F1): Works R9M + R9MX (F1 + L4): Doesn't Work R9M + Generic (F1 + ESP8255): Doesn't Work Bandit + Generic (ESP32 + ESP8255): Doesn't Work
"Seemingly okay on 2.4 | 19 Hz. Testing with DIY E28 F1 Tx and SuperD Rx."
it was speculated that it could be spi communication with sx1276 the "nice" aspect of that is that it would explain why in the code I can't see the rx to change sync_word but some things don't fit here too
why not only sporadically, but systematically, one bad communication and all communication wrong hencforth Good point which I suppose means it could be the sx1276 that is in a bad state. Would be interesting to add a way to reinit only the sx1276 at will to see if that gets it going again.
I was actually thinking of adding a rescue mode ... but this would require me to be able to detect the failure state on the rx ... which I didn't manage yet
it would have to be strange bad state, since obviously it is still transmitting
I will also say that I think that we either find that issue in a not too long distance or we are going to face a tough decission
MSP gets along, and also DroneCAN might be close ... and I think a next beta version is overdue anyway ... if we don't want to postpone until infinity we might have to exclude all sx1276 targest from such an release ...
exclude all sx1276 targest from such an release
Has this been reproduced on MAVLink (non-X) + SX127x?
Has this been reproduced on MAVLink (non-X) + SX127x?
not by me, to me it seems outcommenting compression or disabling mavlinkX avoids the issue disabling mavlinkX is IMHO however not desirable, since our mavlink link robustness depends on it disabling compression would break backwards compatibility IMHO main problem is a bit that this might be just anecdotical ... we would have to do long time tests for kind of each combination to be on the safe side (which I kind of do now for the other targets, each day and night a new combo)
It has been slow, but, It seems to me that we (you) are still making progress in characterizing and isolating the problem.
I would suggest that we have not yet reached the point where we need to break compatibility or, worse, drop support for a fairly popular hardware platform. Isn't it easy enough to deprecate compression without breaking compatibility?
For the next release, I would suggest disabling compression only on the sending end, leaving the decompressing code in place and do this only for SX127x. If this avoids the problem in long running tests, we can, then recommend upgrading anything using SX127x and even if users upgrade only one end at a time, compatibility will be maintained and they will be no worse off than they are with the current release.
For the next release, I would suggest disabling compression only on the sending end, leaving the decompressing code in place
I had the same idea as this at least would leave backwards compatibility, but I found this to also show the issue. No viable option hence. IMHO.
so, did some long time tests for
I then wanted to change the tx module to the betafpv 1W micro ... and I just get CATASTROPHIC behavior
it seems it's related to when being in the live-update main screen, comes later/rarer when in an other screen, so could suggest an OLED issue
FOLKS: WE HAVE SERIOUS PROBLEMS ALL AROUND
to me it seems that we rather will have to drop all ELRS targets, and the sx127x STM32 targets ... I frankly observe just too many issues. The Bandit and BetaFPV targets I think need to be removed ... it seems we have merged them too prematurely and they didn't had seen the sufficient testing. My feeling is that I hadn't seen the issues before becasue I hadn't really tried 19 Hz, as now.
as regards the issue itself, I start wondering if it's not an issue with the sx1276 driver which somehow over time crept in. Sometimes it feels to me like we are back before we had the AFC correction from Brad. Is the AFC stuff still working? just brain storming
Since it sounds like compression plays a role (in some way) - will this be removed in the next release (1.4) ?
Since it sounds like compression plays a role (in some way) - will this be removed in the next release (1.4) ?
I would not like to disable it for targets where it is just working fine.
we could make the breaking transition to disable it for all sx1276 targets ... but even this would not free us from very carefully testing any target (since attrebuting the issue to compressi is also just anecdotical, not based on a solid understanding)
it's just clear in my mind that we are going to face a severe decission in a very near future ... MSP is getting along, INAV 8.0 rc1 is near, and also DroneCAN is getting along, and AP4.6beta is near.
Latest firmware on R9M (Tx) and R9MX (Rx), within ~5 minutes of having serial connection get:
Switching to Mavlink instead of MavlinkX seems to make it go away. Issue seems similar to previously reported issue when using GCC 12. No problem with connection when there is not an active serial connection.