Bug? | Weird Tx/Rx Disconnect

jlpoltrack commented 1 month ago

Latest firmware on R9M (Tx) and R9MX (Rx), within ~5 minutes of having serial connection get:

Tx reports Rx is disconnected (red LED), Rx appears connected (green LED)
Rx still outputs RC data over CRSF
Lua shows only Tx, no Telemetry elements

Switching to Mavlink instead of MavlinkX seems to make it go away. Issue seems similar to previously reported issue when using GCC 12. No problem with connection when there is not an active serial connection.

olliw42 commented 1 month ago

do we know a last good version/commit?

jlpoltrack commented 1 month ago

do we know a last good version/commit?

1.3.00 release looks good.

olliw42 commented 1 month ago

ups ... that's a long way back ...

jlpoltrack commented 1 month ago

Got this disconnect behavior on 1.3.01:

HEAD is now at 15c198f6 v bump 1.3.01 (=dev)

Full hash - 15c198f61602ca500fbb7527fb90dac924c1130e

olliw42 commented 1 month ago

@jlpoltrack don't fully understand

you're saying, "v1.3 (release)" or f12d6803 is ok, but already "v bump 1.3.01 (=dev)" or 15c198f6 is not ??

if so, did you compile in both cases, or did you take the binaries form the folder for "v1.3 (release)"? if you did compile, did you also roll back the submodules (by e.g. running run_setup.py)?

olliw42 commented 1 month ago

@jlpoltrack @brad112358 I seem to get the very same behavior what I did is

checked out 15c198f
since I find it easier to do the testing with using the RadioMaster Bandit instead of the R9M (has display, and I'm used to it's fail behavior now) I added the few lines which are needed to get the RM Bandit to compile and work (platformio.ini, hal, and disp stuff)

The receiver is btw a R9MX.

when I have mavlinkX enabled it "chrashes" either at startup or soon after power up. With mavlinkX disabled it all seems fine. (even when mavlinkX is disabled LQ is often below 100%, like 90%-ish low, not sure why that is, feels like before Brad's AFC finding, but that's another story)

Questions I have:

I really can't understand that different behavior now. I'm absolutely sure that I have not updated my STM32Ide, and it says version 1.13.2, and the gcc is 11.3.rel1. ANY idea as to why this can be ??
strange fact, seems to be an issue only with the sx1276 (or sx1276 based devices)
when the issue appears, it's not a crash in the sense that one device would stop working, but it looks as if thh downlink connection is gone. Now, this can be because (A) the receiver stops sending or because (B) the tx module stops receiving. I have difficulties to work out which case it is. You have insight if (A) or (B)?

olliw42 commented 1 month ago

next finding replaced the R9MX by an ELRS GENERIC 900 (not sure what brand it is) ...

it seems that the issue exists when mavlinkX is enabled and there is a serial stream on the receiver, if the serial is disconnected from the FC it appears to be all good
it seems to exist for both the R9MX and Generic900

-> so, if it should be a receiver side thing (case (A)) then it seems to not be a STM32 vs ESP thing, which implies it is not a compiler thing -> on the other hand it's reported for bioth RM Bandit and R9M, which implies if it is a case (B) thing it also is not a compiler thing -> ... which makes it strange it was working so far for gcc11 and the ESP compiler ...

jlpoltrack commented 1 month ago

Thanks for confirming you see same. Perhaps one extra piece I can add - when the 'crash' happens, power cycling the Rx seems to resolve which I don't quite understand.

when the issue appears, it's not a crash in the sense that one device would stop working, but it looks as if thh downlink connection is gone. Now, this can be because (A) the receiver stops sending or because (B) the tx module stops receiving. I have difficulties to work out which case it is. You have insight if (A) or (B)?

Have power meter hooked up to Rx, when 'crash' happens Rx still continues to transmit something.

olliw42 commented 1 month ago

when the 'crash' happens, power cycling the Rx seems to resolve which I don't quite understand.

same here. My interpretation is that some fields get reset and it can start again - might be totally wrong track though

Have power meter hooked up to Rx, when 'crash' happens Rx still continues to transmit something.

when I use the ELRS generic 900, I do see the same.

so, at least we seem to see the same sympthoms ... that's good ... but I just can't make any sense of it yet.

so far I was speculating, as we had the issue "just" with the gcc12, that there is somewhere some memory leak and that some variable gets overriden ... the fact that gcc12 did/does arrange the variables differently in memory I considered supporting this view ... but now we see this for two different compiler, and "only" for sx1276 ...

olliw42 commented 1 month ago

Have power meter hooked up to Rx, when 'crash' happens Rx still continues to transmit something.

so, the rx seems to be happy to transmit ... question is thus if (C) the receiver is sending invalid frames so that the tx receives but rejects them (e.g. wrong bind phrase, wrong frequency, wrong header or crc, etc. pp) or is it indeed (B) that the tx stops receiving them.

since it's not easy to have debug output on the bandit I find that difficult to inversigate ... I guess I need to get using the R9M LOL

jlpoltrack commented 1 month ago

since it's not easy to have debug output on the bandit I find that difficult to inversigate ... I guess I need to get using the R9M LOL

If you're keen - you can use JRPin5 on Bandit as a UART Output - it is pin 13. https://github.com/ExpressLRS/targets/blob/master/TX/Radiomaster%20Bandit%20Micro.json#L2-L3

olliw42 commented 1 month ago

ah, intersting idea what SERIAl would I have to use, SERIAL and SERIAL1 are already used up .... can I use SERIAL2?

jlpoltrack commented 1 month ago

can I use SERIAL2?

Yea, e.g. https://github.com/G6EJD/ESP32-Using-Hardware-Serial-Ports/blob/master/ESP32_Using_Serial2.ino#L17

olliw42 commented 1 month ago

but how would I set it up, out lib needs bot rx and tx ot be specified, but they are on the same pin ... is this no problem?

olliw42 commented 1 month ago

ah, I set RX to -1 ??

jlpoltrack commented 1 month ago

ah, I set RX to -1 ??

That should work, same as done for RC Out on RP4-TD https://github.com/olliw42/mLRS/blob/main/mLRS/Common/hal/esp/rx-hal-radiomaster-rp4td-2400-esp32.h#L27

olliw42 commented 1 month ago

it seems that in the failure mode the rx sends and the tx receives ... not sure yet what makes the tx not accept the packets

tmcadam commented 1 month ago

Just to confuse matters further.

I noticed that by chance I have my R9M and BAYCK NANO PRO 900 at version 1.3.01, so gave them a bench test. I don't seem to see these issues with mavlinkx or mavlink set. I have a telemetry stream running also. Stable connection for well over 10 mins.

olliw42 commented 1 month ago

Just to confuse matters further.

I noticed that by chance I have my R9M and BAYCK NANO PRO 900 at version 1.3.01, so gave them a bench test. I don't seem to see these issues with mavlinkx or mavlink set. I have a telemetry stream running also. Stable connection for well over 10 mins.

I think you would observe the "new" behavior if you would recompile the firmware and reflash ... while I don't understand at all why this would be so, it would fit the pattern ... it just doesn't make sense that v1.3.00 firmware is all good but just one commit later it's all bad ... and I firmely believe it wasn't ... something mysterious must have happened to our build systems

olliw42 commented 1 month ago

some more data point ... the tx receives an incorrect sync word when the issue occurs it should be 0x7C85 but suddenly becomes 0x85D5 ... does it miss a byte ??

jlpoltrack commented 1 month ago

This sync word from the mLRS OTA? (Not the LoRa sync word?) https://github.com/olliw42/mLRS/blob/c4f7ce779dbe0768a3ab57914e1455c9c1c15f55/mLRS/Common/frame_types.h#L105

olliw42 commented 1 month ago

yes, that's the sync word send over air, and checked here: https://github.com/olliw42/mLRS/blob/main/mLRS/CommonTx/mlrs-tx.cpp#L317-L319

when using the R9MX it changes from 0x7C85 (correct) to 0x5220 ...

olliw42 commented 1 month ago

when one repowers the tx, when the rx still sends the incorrect sync word ... -> it seems the issue actually happens in the rx ,,, it would explain why one has to repower the rx wheras repowering the tx does not cure the issue am trying to confirm by adding debug to the rx

olliw42 commented 1 month ago

come on ... when I add a debug line on the R9MX side to print the syncword ... when it just works and works ... if I outcomment this line, the issue shows ...

jlpoltrack commented 1 month ago

Syncword is getting optimized out somehow without the print function present? (what a pain)

olliw42 commented 1 month ago

I've put the dbg line in various places ... same behavior

how can we figure out if it's "optimized out"?

olliw42 commented 1 month ago

if it's such a thing, wouldn't it be strange that two different compiler would do the same "mistake"

olliw42 commented 1 month ago

@rotorman out of desperation, you happen to have any idea on such an issue?

olliw42 commented 1 month ago

this all just does not make any sense

with the debug line dbg.puts("\n>x");dbg.puts(u16toHEX_s(Config.FrameSyncWord));dbg.puts(",");dbg.puts(u16toHEX_s(rxFrame.sync_word)); at the end of do_transmit() it runs and runs
with dbg.puts("\n>x");dbg.puts(u16toHEX_s(Config.FrameSyncWord)); or dbg.puts("\n>x");dbg.puts(u16toHEX_s(rxFrame.sync_word)); the issue soon happens, but the debug output is always the correct 7C85, even though the tx receives 85D5

if the issue depends on what code I run on the receiver, one would think that the tx isn't responsible for making it a 85D5 ...

jlpoltrack commented 1 month ago

Just some wild thoughts - could the memory for the OTA get statically defined? Or perhaps some variables can be marked volatile?

olliw42 commented 1 month ago

Just some wild thoughts - could the memory for the OTA get statically defined? Or perhaps some variables can be marked volatile?

statically I don't understand well, volatile yeah but which ones need that?

am reading this https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8 I have for a longer while the suspicion that the various typcasting which is going on with some structs may be an issue...

brad112358 commented 1 month ago

I have, so far, been unable to reproduce this. I have upgraded my R9M ACCST (Older R9M) and R9MM to the latest pre-built pre-release firmware, tx-R9M-f103c8-elrs-bl-v1.3.03-@b302689c.elrs and rx-R9MM-f103rb-elrs-bl-v1.3.03-@b302689c.elrs. It has been running for 30 minutes so far. My non-default parameters are Rx Ch Order = ETAR, Rx Ser Baudrate = 230400, Rx Snd Rc Channel = rc override.

olliw42 commented 1 month ago

yes ... for totally unknown reasons the issue seems to have appeared just recently ... (we assume mavlinkX is enabled, right?)

could you also try by using self-compiled firmwares?

brad112358 commented 1 month ago

Yes, mavlinkX is enabled. Wasn't it observed in V1.3.01 pre-built?

jlpoltrack commented 1 month ago

Wasn't it observed in V1.3.01 pre-built?

I did a git checkout back to 1.3.01 were it appeared to be an issue, but after your note I also see the issue on the latest pre-compiled binaries. Here's my summary at the moment, I usually get a crash between 2 and 5 minutes with an active serial connection.

R9M and R9MX - only change from default is 38400 baud on the Rx

brad112358 commented 1 month ago

What does No mean above? The text in this comment seems to say you do see the issue with pre-compiled 1.3.03, but then you have "No" everywhere in the table.

olliw42 commented 1 month ago

I do see the issue for both v1.3.03 latest and with the code of the first commit of v1.3.01 in effect, when I do compile now it seems I always get the issue irrespective of what code version have not tested the pre-compiled firmwares available in the github repo, but from the info gathered by others it seems to me that these pre-compiled files are ok

jlpoltrack commented 1 month ago

What does No mean above? The text in this comment seems to say you do see the issue with pre-compiled 1.3.03, but then you have "No" everywhere in the table.

No means 'No' it doesn't work :)

jlpoltrack commented 1 month ago

To add further confusion, am trying a R9M as an Rx and it seems perfectly fine with a self-compiled build on MavlinkX. Only change on the Rx side is 38400 baud.

brad112358 commented 1 month ago

I switched to an R9MX receiver and quickly reproduced the problem. I'll try with a locally built version.

olliw42 commented 1 month ago

yo, it's totally weird and non-consistent ...

olliw42 commented 1 month ago

the only things which seem to be "clear" (if that word can be used) is that

it appears the issue is receiver-side
it appears to be related to the compression part (if I undefine MAVLINKX_COMPRESSION it seems to work too)

brad112358 commented 1 month ago

Note, I haven't compiled this in a while so I had to mark dronecan_dsdlc.py executable (chmod +x mLRS/modules/dronecan/dronecan_dsdlc/dronecan_dsdlc.py) and install the dronecan python library (python3 -m pip install dronecan) to get run_setup.py to work. @olliw42 I suggest mark this file executable in git with "git update-index"

jlpoltrack commented 1 month ago

An MCU architecture thing?

R9M + R9MM (Both STM32F1): Works
R9M + R9M (Both STM32F1): Works
R9M + R9MX (F1 + L4): Doesn't Work
R9M + Generic (F1 + ESP8255): Doesn't Work
Bandit + Generic (ESP32 + ESP8255): Doesn't Work

olliw42 commented 1 month ago

suggest mark this file executable in git with "git update-index"

what does this mean?

brad112358 commented 1 month ago

I reproduced the problem also on R9MX with a locally built version of git b302689c ( just to exactly match the pre-compiled version where it also failed for me).

olliw42 commented 1 month ago

I believe to have narrowed it down to this part of the code https://github.com/olliw42/mLRS/blob/main/mLRS/Common/thirdparty/mavlinkx.h#L390-L414 when I make it to not be used the issue seems to go away ... I was suspecting like all places in the code, but not this one LOL

brad112358 commented 1 month ago

what does this mean?

Never mind. I didn't realize that the dronecan_dsdlc submodule comes directly from the dronecan project. I'll probably send them a PR to fix the permissions.

brad112358 commented 1 month ago

I believe to have narrowed it down to this part of the code https://github.com/olliw42/mLRS/blob/main/mLRS/Common/thirdparty/mavlinkx.h#L390-L414 when I make it to not be used the issue seems to go away ... I was suspecting like all places in the code, but not this one LOL

Buffer overrun?

olliw42 commented 1 month ago

Buffer overrun?

yes, this is the suspicion since ever, also for the old MavlinkX issue here https://github.com/olliw42/mLRS/issues/159 ... would fit into compiler dependency ... but all my starring at the code hasn't shown it to me ... but I may have starred too much at the wrong part of the code ...

... I have a feeling I see it now ... EDIT: ... no ... still don't get it :(

olliw42 / mLRS

Bug? | Weird Tx/Rx Disconnect #221