Open jlpoltrack opened 1 month ago
do we know a last good version/commit?
do we know a last good version/commit?
1.3.00 release looks good.
ups ... that's a long way back ...
Got this disconnect behavior on 1.3.01:
HEAD is now at 15c198f6 v bump 1.3.01 (=dev)
Full hash - 15c198f61602ca500fbb7527fb90dac924c1130e
@jlpoltrack don't fully understand
you're saying, "v1.3 (release)" or f12d6803 is ok, but already "v bump 1.3.01 (=dev)" or 15c198f6 is not ??
if so, did you compile in both cases, or did you take the binaries form the folder for "v1.3 (release)"? if you did compile, did you also roll back the submodules (by e.g. running run_setup.py)?
@jlpoltrack @brad112358 I seem to get the very same behavior what I did is
The receiver is btw a R9MX.
when I have mavlinkX enabled it "chrashes" either at startup or soon after power up. With mavlinkX disabled it all seems fine. (even when mavlinkX is disabled LQ is often below 100%, like 90%-ish low, not sure why that is, feels like before Brad's AFC finding, but that's another story)
Questions I have:
next finding replaced the R9MX by an ELRS GENERIC 900 (not sure what brand it is) ...
-> so, if it should be a receiver side thing (case (A)) then it seems to not be a STM32 vs ESP thing, which implies it is not a compiler thing -> on the other hand it's reported for bioth RM Bandit and R9M, which implies if it is a case (B) thing it also is not a compiler thing -> ... which makes it strange it was working so far for gcc11 and the ESP compiler ...
Thanks for confirming you see same. Perhaps one extra piece I can add - when the 'crash' happens, power cycling the Rx seems to resolve which I don't quite understand.
- when the issue appears, it's not a crash in the sense that one device would stop working, but it looks as if thh downlink connection is gone. Now, this can be because (A) the receiver stops sending or because (B) the tx module stops receiving. I have difficulties to work out which case it is. You have insight if (A) or (B)?
Have power meter hooked up to Rx, when 'crash' happens Rx still continues to transmit something.
when the 'crash' happens, power cycling the Rx seems to resolve which I don't quite understand.
same here. My interpretation is that some fields get reset and it can start again - might be totally wrong track though
Have power meter hooked up to Rx, when 'crash' happens Rx still continues to transmit something.
when I use the ELRS generic 900, I do see the same.
so, at least we seem to see the same sympthoms ... that's good ... but I just can't make any sense of it yet.
so far I was speculating, as we had the issue "just" with the gcc12, that there is somewhere some memory leak and that some variable gets overriden ... the fact that gcc12 did/does arrange the variables differently in memory I considered supporting this view ... but now we see this for two different compiler, and "only" for sx1276 ...
Have power meter hooked up to Rx, when 'crash' happens Rx still continues to transmit something.
so, the rx seems to be happy to transmit ... question is thus if (C) the receiver is sending invalid frames so that the tx receives but rejects them (e.g. wrong bind phrase, wrong frequency, wrong header or crc, etc. pp) or is it indeed (B) that the tx stops receiving them.
since it's not easy to have debug output on the bandit I find that difficult to inversigate ... I guess I need to get using the R9M LOL
since it's not easy to have debug output on the bandit I find that difficult to inversigate ... I guess I need to get using the R9M LOL
If you're keen - you can use JRPin5 on Bandit as a UART Output - it is pin 13. https://github.com/ExpressLRS/targets/blob/master/TX/Radiomaster%20Bandit%20Micro.json#L2-L3
ah, intersting idea what SERIAl would I have to use, SERIAL and SERIAL1 are already used up .... can I use SERIAL2?
can I use SERIAL2?
Yea, e.g. https://github.com/G6EJD/ESP32-Using-Hardware-Serial-Ports/blob/master/ESP32_Using_Serial2.ino#L17
but how would I set it up, out lib needs bot rx and tx ot be specified, but they are on the same pin ... is this no problem?
ah, I set RX to -1 ??
ah, I set RX to -1 ??
That should work, same as done for RC Out on RP4-TD https://github.com/olliw42/mLRS/blob/main/mLRS/Common/hal/esp/rx-hal-radiomaster-rp4td-2400-esp32.h#L27
it seems that in the failure mode the rx sends and the tx receives ... not sure yet what makes the tx not accept the packets
Just to confuse matters further.
I noticed that by chance I have my R9M and BAYCK NANO PRO 900 at version 1.3.01, so gave them a bench test. I don't seem to see these issues with mavlinkx
or mavlink
set. I have a telemetry stream running also. Stable connection for well over 10 mins.
Just to confuse matters further.
I noticed that by chance I have my R9M and BAYCK NANO PRO 900 at version 1.3.01, so gave them a bench test. I don't seem to see these issues with
mavlinkx
ormavlink
set. I have a telemetry stream running also. Stable connection for well over 10 mins.
I think you would observe the "new" behavior if you would recompile the firmware and reflash ... while I don't understand at all why this would be so, it would fit the pattern ... it just doesn't make sense that v1.3.00 firmware is all good but just one commit later it's all bad ... and I firmely believe it wasn't ... something mysterious must have happened to our build systems
some more data point ... the tx receives an incorrect sync word when the issue occurs it should be 0x7C85 but suddenly becomes 0x85D5 ... does it miss a byte ??
This sync word from the mLRS OTA? (Not the LoRa sync word?) https://github.com/olliw42/mLRS/blob/c4f7ce779dbe0768a3ab57914e1455c9c1c15f55/mLRS/Common/frame_types.h#L105
yes, that's the sync word send over air, and checked here: https://github.com/olliw42/mLRS/blob/main/mLRS/CommonTx/mlrs-tx.cpp#L317-L319
when using the R9MX it changes from 0x7C85 (correct) to 0x5220 ...
when one repowers the tx, when the rx still sends the incorrect sync word ... -> it seems the issue actually happens in the rx ,,, it would explain why one has to repower the rx wheras repowering the tx does not cure the issue am trying to confirm by adding debug to the rx
come on ... when I add a debug line on the R9MX side to print the syncword ... when it just works and works ... if I outcomment this line, the issue shows ...
Syncword is getting optimized out somehow without the print function present? (what a pain)
I've put the dbg line in various places ... same behavior
how can we figure out if it's "optimized out"?
if it's such a thing, wouldn't it be strange that two different compiler would do the same "mistake"
@rotorman out of desperation, you happen to have any idea on such an issue?
this all just does not make any sense
dbg.puts("\n>x");dbg.puts(u16toHEX_s(Config.FrameSyncWord));dbg.puts(",");dbg.puts(u16toHEX_s(rxFrame.sync_word));
at the end of do_transmit() it runs and runsdbg.puts("\n>x");dbg.puts(u16toHEX_s(Config.FrameSyncWord));
or dbg.puts("\n>x");dbg.puts(u16toHEX_s(rxFrame.sync_word));
the issue soon happens, but the debug output is always the correct 7C85, even though the tx receives 85D5if the issue depends on what code I run on the receiver, one would think that the tx isn't responsible for making it a 85D5 ...
Just some wild thoughts - could the memory for the OTA get statically defined? Or perhaps some variables can be marked volatile?
Just some wild thoughts - could the memory for the OTA get statically defined? Or perhaps some variables can be marked volatile?
statically I don't understand well, volatile yeah but which ones need that?
am reading this https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8 I have for a longer while the suspicion that the various typcasting which is going on with some structs may be an issue...
I have, so far, been unable to reproduce this. I have upgraded my R9M ACCST (Older R9M) and R9MM to the latest pre-built pre-release firmware, tx-R9M-f103c8-elrs-bl-v1.3.03-@b302689c.elrs and rx-R9MM-f103rb-elrs-bl-v1.3.03-@b302689c.elrs. It has been running for 30 minutes so far. My non-default parameters are Rx Ch Order = ETAR, Rx Ser Baudrate = 230400, Rx Snd Rc Channel = rc override.
yes ... for totally unknown reasons the issue seems to have appeared just recently ... (we assume mavlinkX is enabled, right?)
could you also try by using self-compiled firmwares?
Yes, mavlinkX is enabled. Wasn't it observed in V1.3.01 pre-built?
Wasn't it observed in V1.3.01 pre-built?
I did a git checkout back to 1.3.01 were it appeared to be an issue, but after your note I also see the issue on the latest pre-compiled binaries. Here's my summary at the moment, I usually get a crash between 2 and 5 minutes with an active serial connection.
R9M and R9MX - only change from default is 38400 baud on the Rx
What does No mean above? The text in this comment seems to say you do see the issue with pre-compiled 1.3.03, but then you have "No" everywhere in the table.
I do see the issue for both v1.3.03 latest and with the code of the first commit of v1.3.01 in effect, when I do compile now it seems I always get the issue irrespective of what code version have not tested the pre-compiled firmwares available in the github repo, but from the info gathered by others it seems to me that these pre-compiled files are ok
What does No mean above? The text in this comment seems to say you do see the issue with pre-compiled 1.3.03, but then you have "No" everywhere in the table.
No means 'No' it doesn't work :)
To add further confusion, am trying a R9M as an Rx and it seems perfectly fine with a self-compiled build on MavlinkX. Only change on the Rx side is 38400 baud.
I switched to an R9MX receiver and quickly reproduced the problem. I'll try with a locally built version.
yo, it's totally weird and non-consistent ...
the only things which seem to be "clear" (if that word can be used) is that
Note, I haven't compiled this in a while so I had to mark dronecan_dsdlc.py executable (chmod +x mLRS/modules/dronecan/dronecan_dsdlc/dronecan_dsdlc.py) and install the dronecan python library (python3 -m pip install dronecan) to get run_setup.py to work. @olliw42 I suggest mark this file executable in git with "git update-index"
An MCU architecture thing?
suggest mark this file executable in git with "git update-index"
what does this mean?
I reproduced the problem also on R9MX with a locally built version of git b302689c ( just to exactly match the pre-compiled version where it also failed for me).
I believe to have narrowed it down to this part of the code https://github.com/olliw42/mLRS/blob/main/mLRS/Common/thirdparty/mavlinkx.h#L390-L414 when I make it to not be used the issue seems to go away ... I was suspecting like all places in the code, but not this one LOL
what does this mean?
Never mind. I didn't realize that the dronecan_dsdlc submodule comes directly from the dronecan project. I'll probably send them a PR to fix the permissions.
I believe to have narrowed it down to this part of the code https://github.com/olliw42/mLRS/blob/main/mLRS/Common/thirdparty/mavlinkx.h#L390-L414 when I make it to not be used the issue seems to go away ... I was suspecting like all places in the code, but not this one LOL
Buffer overrun?
Buffer overrun?
yes, this is the suspicion since ever, also for the old MavlinkX issue here https://github.com/olliw42/mLRS/issues/159 ... would fit into compiler dependency ... but all my starring at the code hasn't shown it to me ... but I may have starred too much at the wrong part of the code ...
... I have a feeling I see it now ... EDIT: ... no ... still don't get it :(
Latest firmware on R9M (Tx) and R9MX (Rx), within ~5 minutes of having serial connection get:
Switching to Mavlink instead of MavlinkX seems to make it go away. Issue seems similar to previously reported issue when using GCC 12. No problem with connection when there is not an active serial connection.