olliw42 / mLRS

2.4 GHz & 915/868 MHz & 433 MHz/70 cm LoRa based telemetry and radio link for remote controlled vehicles
GNU General Public License v3.0
295 stars 67 forks source link

Disable mbridge serial port receiver when transmitter is enabled #65

Closed brad112358 closed 1 year ago

brad112358 commented 1 year ago

This eliminates the spurious received characters I was seeing in my debug output on my R9M With this change I'm no longer seeing lost messages when connecting QGC via bluetooth routed from mbridge with openTX.

olliw42 commented 1 year ago

many thx

interesting ...

could you figure out/understand why only disabling the isr doesn't do it? I mean, it should... shouldn't it? I really would like to understand this better. Is it so that it is always only one spurious byte? Could then imply that it is some byte already in the double buffers.

pl use the respective LL functions, should be e.g. LL_USART_EnableDirectionRx(). I guess we also want to put it into the low level libs

brad112358 commented 1 year ago

I don't see any reason why only disabling the isr would prevent data from entering the USART shift register or Receive Data register. Then, when the interrupt is re-enabled, I'd expect it to fire immediately and the data register would contain data received while the interrupt was disabled. Perhaps a second character from the shift register might also be received, I haven't checked.

When I added my dbg output I wasn't looking for this specific bug and I only output the first unexpected character received after each transmit (The dbg output is lower baud rate so I didn't want to output much). So all I can tell you is that in parse_nextchar() in STATE_IDLE when we were expecting MBRIDGE_STX1, we often received 0xA2. Sometimes we received 0x00. Since the data received was pretty consistent, I guessed that this byte might be from MBRIDGE_CMD_TX_LINK_STATS and when I eliminated those from being transmitted, the spurious 0xA2 bytes went away and I sometimes saw other characters (I think mostly 0x00, but I can't remember). At that point, I realized what the problem was, fixed it, and verified that the state machine now received no unexpected characters.

I don't know exactly what happens when the USART receive data register overflows. I'm not especially interested in investigating further.

BTW, is it possible that this problem also causes the occasional mLRS configurator script crashes and errors like failed to read parameters in this script?

olliw42 commented 1 year ago

I don't see any reason why only disabling the isr would prevent data from entering the USART shift register or Receive Data register. Then, when the interrupt is re-enabled, I'd expect it to fire immediately and the data register would contain data received while the interrupt was disabled.

well, when the isr is re-enabled, also the isr pending bit is cleared and thus the isr does NOT fire opposed to your assumption ... it should fire only when again a new byte is received ...

strange

BTW, is it possible that this problem also causes the occasional mLRS configurator script crashes and errors like failed to read parameters in this script?

I don't see any occasional mLRS configure script crashes anymore (on OTX at least), are you refereing here to some outdated situation? or you are saying you still get them?

besides this, this thought crossed my mind too, that it e.g. may prvent having to reload, or that teh script now also works better on EdgeTx (as said before, the issue should be the same with CRSF)

brad112358 commented 1 year ago

I did still get them, but I'm not sure I'm running the latest script.

OK. Based on your logic, it can't be left over received data which is received when transmit is done. But: It looks like UART_IRQHandler() will be called to indicate transmit complete and it calls UART_RX_CALLBACK_FULL() when the USART_DR is full even if the receive interrupt enable (RXNEIE) is not set. So we will receive and process some or all of what we send. Perhaps all but the last character?

jlpoltrack commented 1 year ago

FYI - I did a quick test of these changes on E28 Dual + EdgeTx and didn't notice any difference with the Lua. 31 / 50 Hz generates a fair amount of errors.

olliw42 commented 1 year ago

But: It looks like UART_IRQHandler() will be called to indicate transmit complete and it calls UART_RX_CALLBACK_FULL() when the USART_DR is full even if the receive interrupt enable (RXNEIE) is not set.

I'm not sure I understand what you are saying: UART_RX_CALLBACK_FULL() is called inside if (usart_sr & FLAG_SR_RXNE) {}, but this is called even though the rx isr is disabled?

In your point of view, should it therefroe resolve the issue if

(b) could be conveniently done by adding to uart_rx_enableisr() the calls LL_USART_ReadReg(UART_UARTx, REG_SR); LL_USART_ReceiveData8(UART_UARTx); before the IT flags are cleared and enabled. If this would solve the issue too it would be a clean solution.

is it possible that all we need to do is to add a LL_USART_ClearFlag_ORE(UART_UARTx) in uart_rx_enableisr()?

brad112358 commented 1 year ago

I'm not sure I understand what you are saying: UART_RX_CALLBACK_FULL() is called inside if (usart_sr & FLAG_SR_RXNE) {}, but this is called even though the rx isr is disabled?

Yes, since UART_IRQHandler() is not checking RXNEIE.

If you see a better way to fix it than the obvious way I've proposed, feel free to ignore this PR and fix it. But, on quick examination, I don't see how your (b) fixes anything and I my solution seems more natural to me than your (a).

brad112358 commented 1 year ago

BTW, I haven't tested this, but I suspect there is actually no need to disable the RXNE interrupt at all if we just disable the receiver (RE) when we transmit.

olliw42 commented 1 year ago

first off, I'd like to understand the underlying issue. I generally do not feel well with doing things I don't understand why. This also always bears the risk that while it may mitigate an issue in a particular case, it may actually create an unexpected issue elsewhere later on. second off, I (we!!) don't want to have hardware related code in the main libs. There unfortunately is some, but it doesn't mean that's good, and should be expanded. It should be reduced, so I try to see how this could be better packed into the hardware files. It is not my intention to ignore the PR, but to improve upon considering (1) and (2). You seem to have a good test setup, and I do not want to recreate one for myself, hence my questions instead of trying myself.

brad112358 commented 1 year ago

You seem to be saying you don't yet fully understand the underlying issue. I didn't see the full root cause when I first submitted the PR, but after you pointed out that we clear the interrupt when we enable it, I looked at the library more carefully and now I do think I see the full root cause. I've tried to explain it, but I don't know what you are missing. I explained with references to specific functions and registers above. Please read that again. Meanwhile, I'll try again in more general terms here:

There are not separate low level ISR routines for transmit and receive. There is only ONE ISR which handles both transmit and receive, the address of which is stored in the interrupt vector table. If I'm not mistaken, the interrupt vector table for our USART points to the UART_IRQHandler() function. This function is called for both receive and transmit. You tried to prevent receive by disabling the interrupt for receive, but you left the interrupt enabled for transmit. You did not actually disable receive in the USART. Thus, when we transmit on a half duplex line, a transmit complete interrupt is generated and the low level ISR sees the a character has also been received and it calls the functions which handle this as I explained previously.

Thus, the root cause is a combination of not disabling the receiver, not disabling the transmit complete interrupt (which we need), and having a single function which handles the interrupt for both transmit and receive (a hardware design fact). Disabling the receiver when we transmit seems the most logical fix to me. Where you put the code to do this is a detail you can decide.

Does this make clear the root cause?

olliw42 commented 1 year ago

it doesn't really explain it in transmit the bytes are reflected to the receiver, so the reciever does ALWAYS receive something, which means that the RXNE flag MUST be set at the time of TC (if a transmission was happening of course). Therefore, there ALWAYS would have to be a spurious byte. Which is not what happens, as you also report. QED

there is a second dangling point, not sure you have noted, there are code fragments there I tried to not having to use the diode for the jrbridge, but whatever I tried so far I couldn't get it to work, but it shouldo if it is as you say. This also makes me suspect that there is a detail still missing. And as third point, I'm using this exact kind of half duplex since ages in the STorM32 project, with no such issues. Also this makes me supsect there is a detail missing.

brad112358 commented 1 year ago

Ignoring what I initially concluded was being received based on my limited dbg output, do you see the bug in the code or not?

On Mon, Apr 3, 2023, 3:18 AM olliw42 @.***> wrote:

it doesn't really explain it in transmit the bytes are reflected to the receiver, so the reciever does ALWAYS receive something, which means that the RXNE flag MUST be set at the time of TC (if a transmission was happening of course). Therefore, there ALWAYS would have to be a spurious byte. Which is not what happens, as you also report. QED

there is a second dangling point, not sure you have noted, there are code fragments there I tried to not having to use the diode for the jrbridge, but whatever I tried so far I couldn't get it to work, but it shouldo if it is as you say. This also makes me suspect that there is a detail still missing. And as third point, I'm using this exact kind of half duplex since ages in the STorM32 project, with no such issues. Also this makes me supsect there is a detail missing.

— Reply to this email directly, view it on GitHub https://github.com/olliw42/mLRS/pull/65#issuecomment-1493884056, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKQOEPBMQDGCWIF6IJKSNLW7KBUTANCNFSM6AAAAAAWPCMMDA . You are receiving this because you authored the thread.Message ID: @.***>

olliw42 commented 1 year ago

I don't know what I should make of this post I think I raised a valid contradiction, which should have manifested itself from day one if you want to call it a "bug in the code" then your approach is obviously not the correct solution of it we probably should change then the lib

brad112358 commented 1 year ago

Do you agree that disabling rx interrupt is not enough to prevent receiving back bytes we transmit with your existing code?

On Mon, Apr 3, 2023, 3:52 AM olliw42 @.***> wrote:

I don't know what I should make of this post I think I raised a valid contradiction, which should have manifested itself from day one if you want to call it a "bug in the code" then your approach is obviously not the correct solution of it we probably should change then the lib

— Reply to this email directly, view it on GitHub https://github.com/olliw42/mLRS/pull/65#issuecomment-1493929431, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKQOEPASJRFNRUFKB2YTJTW7KFVRANCNFSM6AAAAAAWPCMMDA . You are receiving this because you authored the thread.Message ID: @.***>

olliw42 commented 1 year ago

LOL the existing code obviously does not prevent back bytes, if one doesn't want ignore what you initially concluded was being received based on your limited dbg output which I don't, I assume that your report that spurios bytes are received is accurate I sure do see the error path you describe, but I don't see how it would consistently explain the observation; given the current information it does not, so either the information is partially incorrect or it is not the only piece in the explanation. Looks like simple interference to me. Not sure it needs so many posts; could we pl move on and not get stuck in trivia. As said, I would want to understand the situation, where "understand" to me does not mean to only have some half-explanations, and based on that I second would want to identify the best/most proper/appropriate change to the code. As it stands currently, none is really settled. If the described error path is the full story then first (a) should work perfectly and second there has to be an explanation for why it's only sporadic. Sadly, I don't hear anything which contributes here. (btw, the proper solution would then be to change the isr)

brad112358 commented 1 year ago

Olli, Please be patient with me. I think I've learned it's best to take things one step at a time in situations like this since if either of us misunderstands something basic, we can waste a lot of time in pointless debate so I wanted to confirm we are on the same page so far.

So we agree that the existing code ends up processing part or probably all of every transmitted message as received data. I think this depends on details of how long it takes after the last bit is transmitted for the receiver to recognize the stop bit and transfer the shift register to the data register and set the not empty flag. I think it's possible that we might not receive the last byte transmitted. Now we need to consider the effect the received bytes may have on the state machine in parse_nextchar(). We don't transmit a MBRIDGE_STX1 byte at the beginning of the transmitted message from the module to the radio, so most received bytes will stay in STATE_IDLE. But, if the last bugus byte we receive happens to be MBRIDGE_STX1 (about 1/256 data bytes, but we could get lucky or unlucky and have a few of these), we will move to STATE_RECEIVE_MBRIDGE_STX2. This is where the first real trouble happens. Now, the first legitimate byte of the next message will be MBRIDGE_STX1, but we will discard it and move back to STATE_IDLE (either because we weren't already in idle or because of a timeout). Thus, we loose a message in the radio to module direction. I suspect might be is what was causing occasional failures to download parameters since the GCS doesn't seem very smart about retry.

This is where I start to run short of understanding what else might be going wrong. I'm not sure I see a likely scenario that explains how we loose messages in the module to radio direction. But I know we do because I see lost messages in this direction also without this fix.

Does the radio side do anything different if it doesn't get a message back after sending one to the module?

Does this get you closer to being comfortable with what we understand and what we don't?

olliw42 commented 1 year ago

fyi, I've pushed updates to the stm32ll-lib with changes to uart, some were looming anyways, but I've also added the isr checks. tested on three mLRS systems, as well as on several STorM32 NT&T systems not sure, maybe that's enough to mitigate the issue

olliw42 commented 1 year ago

thx for the effort at explaining your picture. I however maintain that it doesn't really explain. The race condition you mention cannot be relevant, since there are always two bytes transmitted at least, the RXNE flag thus must have been raised at the time of TC. In fact, according to the data sheet a second received byte should produce an ORE and not change the DR data byte, that is, the - first - spurious byte in fact should always be a 'O'. Also, you figured yourself that it wouldn't explain things in the other direction (which is the more harmfull in terms of param download). Second, it is easy to test experimentally, by simply closing this error path and seeing what happens. The latest code, with the updated stm32ll-lib, should have closed that path, so it will be as simple as that: either the issue is gone with the latest code or it is not. If yes all is good, and if not all talk about this error path wasn't en point. :)

brad112358 commented 1 year ago

Second, it is easy to test experimentally, by simply closing this error path and seeing what happens. The latest code, with the updated stm32ll-lib, should have closed that path, so it will be as simple as that: either the issue is gone with the latest code or it is not. If yes all is good, and if not all talk about this error path wasn't en point.

Olli, You were the one who was concerned about exactly how this bug was causing the symptoms I saw.

In fact, according to the data sheet a second received byte should produce an ORE and not change the DR data byte, that is, the - first - spurious byte in fact should always be a 'O'.

As I've been saying, when we don't agree, it's often a missed fact on one one side or the other. In this case, you seem to have missed that the two sync bytes 'O' and 'W' are not transmitted in the module to radio direction. The first spurious byte I see is indeed the first byte transmitted. I was briefly confused about this myself, but since you wrote the code I didn't expect you might also be confused, so I didn't mention this in my last few comments. I don't know why the two directions don't use the same format.

Strangely these two bytes are inserted into the received byte stream by your OpenTX code on the radio side and then, removed again by the receive state machine. This code in OpenTX seems more than a little strange to me and while it's doesn't matter for this discussion I'd like to know why you did that!

olliw42 commented 1 year ago

Olli, You were the one who was concerned about exactly how this bug was causing the symptoms I saw.

I have no idea how my statent would impart this.

your experimntal findings will be the test, so lets hear your experimental findings

you seem to have missed that the two sync bytes 'O' and 'W' are not transmitted in the module to radio direction

argh, indeed, I have confused this

I don't know why the two directions don't use the same format.

this is a master slave protocol, and the slave wants to be able to sync with the master

Strangely these two bytes are inserted into the received byte stream

the timing/frame detection is done in the isr, this is just a simple method of telling the parser where the frame start is. Use any method you like.

brad112358 commented 1 year ago

Olli, You were the one who was concerned about exactly how this bug was causing the symptoms I saw.

I have no idea how my statent would impart this.

Because you said this:

first off, I'd like to understand the underlying issue. I generally do not feel well with doing things I don't understand why. This also always bears the risk that while it may mitigate an issue in a particular case, it may actually create an unexpected issue elsewhere later on.

And, to be clear, I'd generally prefer to understand why as well, especially when it's my code that's misbehaving. But, in this case, it just seemed so obvious to me that receiving our own data could cause problems that I didn't want to spend time in all of this discussion.

so lets hear your experimental findings

Yes, I should have some time to test your new code today.

brad112358 commented 1 year ago

I've tested your latest code and your changes do indeed fix the problem. With my debug code added, I no longer see any spurious bytes received nor any unexpected state transitions and I'm seeing no lost messages. I'm going to close this PR since it is no longer needed.

Since the bug is now fixed, I don't want to waste your time or mine with much further discussion, but I do want to mention, as feedback, for your consideration, that our views on the best way to implement half duplex do seem to be quite different.

In my opinion, you have effectively added undocumented side-effects when the rx interrupt is disabled (for example, discarding received data) which I think could interfere with other possible use cases (for example, polled reception or briefly disabling the rx interrupt for the purpose of avoiding a race condition). I think this makes your library more complex, harder to understand, and less general. In my opinion, it would be much more clear to simply directly disable the receiver when in transmit state. You could then just leave the interrupts enabled. In short, I think the way you have fixed the problem is less efficient and more complicated than it needs to be.

But I'm very happy I was able to isolate the problem I was observing with lost messages and I'm even happier it is fixed now.

Thanks! And thanks for mLRS and all the time you spend to make it available to everyone!

olliw42 commented 1 year ago

In my opinion, you have effectively added undocumented side-effects when the rx interrupt is disabled

nonsense. The changes I did are all documented - very loudly in fact - in the datasheets. Second, I did these changes not mainly to somehow resolve the issue here, but since they make sense from the general perspective of what one should do when enabling/disabling isrs and only one isr specifically. It is to the contray odd to enable/disable the direction if one wants to enable/disable the isr.

good it also solves the issue. So, you were correct that it was the error path in the isr which caused these spurious bytes. Given I use that code since more than ten years I'm a bit surprised I haven't noticed that before, but better late than never. So many thx for that.

brad112358 commented 1 year ago

It is to the contray odd to enable/disable the direction if one wants to enable/disable the isr.

Who says we want or need to disable the ISR? The need is to switch directions, to enable transmit and disable receive on the UART or reverse. There is no need to disable the interrupt if we disable receive. I'm not saying it should be done in a function that disables the interrupt.

If the above really doesn't make sense to you, we must be thinking about this in very different ways.