nxp-mcuxpresso / mcux-sdk

MCUXpresso SDK
BSD 3-Clause "New" or "Revised" License
301 stars 136 forks source link

FLEXCAN workaround for chip errata 6032 may cause intermittent RX FIFO overflows in an RTOS multithreaded environment. #66

Closed abascom closed 2 years ago

abascom commented 2 years ago

Is your enhancement proposal related to a problem? Please describe. We were updating our project from SDK version 2.4.2 to 2.10.0 when we observed intermittent RX FIFO overflows on FLEXCAN. After much debugging, we were able to narrow it down to the chip errata 6032 workaround function causing this issue. We run a multithreaded RTOS, and whenever a thread wanted to send a frame on CAN, it could disable all FLEXCAN interrupts, then context switch to another task leaving FLEXCAN interrupts disabled, eventually causing the 6 frame RX FIFO to overflow.

Describe the solution you'd like Either an option added to disable the 6032 errata workaround, or to have it implemented in such a way that interrupts cannot be disabled long enough to overflow the RX FIFO. I am willing to contribute a pull request for one of these options if you can provide some guidance on which approach is preferred.

Describe alternatives you've considered We have locally modified the SDK to disable the workaround and confirmed we no longer get intermittent RX FIFO overflows.

Additional context Similar data loss issues also appear to be present in the implementation of LPUART. Whenever transmit/receive functions are called from an RTOS task, LPUART interrupts are disabled/enabled and if a task switch occurs or some higher priority task is running, received data on the LPUART can be lost.

mcuxsusan commented 2 years ago

Hi @abascom, thanks for reaching us and reporting the issue here.

After discussed with developer, I got the feedback as below:

The developer appreciate if you could help test the suggestion and contribute via PR back to us.

abascom commented 2 years ago

We can certainly give that a try.

With our CAN bus utilization, we estimate we could fill the RX FIFO in ~1ms. Does the developer have any data around how long the busy waits in the 6032 errata workaround could take in the worst case?

mcuxsusan commented 2 years ago

@abascom, yes, please have a try. For your question related to the busy waits, I will provide the feedback from developer later, as he is currently on Holiday leave. Appreciate for your patience.

abascom commented 2 years ago

I have tested the suggested change it is has been working well (I've left it running for a couple of days with regular CAN traffic).

I do have another question around the errata workaround. "

  1. Disable all interrupt
  2. Read CAN_DBG1.CFSM and CAN_DBG1.CBN fields.
  3. Check if CFSM value is either BUSIDLE, RXINTERMISSION or TXINTERMISSION. For the later two values, also check if CBN value is 3, to determine the paired conditions RXINTERMISSION bit 3 or TXINTERMISSION bit 3, and processed as described below. 3.1. If CAN_DBG1 fields indicate BUSIDLE, wait N CPU clocks. 3.2. Else if CAN_DBG1 fields indicate either RXINTERMISSION bit 3 or TXINTERMISSION bit 3 wait until CFSM is different from either RXINTERMISSION or TXINTERMISSION. 3.3. Check again CAN_DBG1 fields, if they indicate BUSIDLE, wait for DELAY time.
  4. Write 0x0 into Code field of CS word.
  5. Enable all interrupts.
  6. Write the ID word.
  7. Write the DATA words.
  8. Write 0xC into Code field of CS word. "

I have tested changing step 1 and 5 to disable/enable all interrupts instead of just flexcan interrupts to get rid of context switching. However even doing that, after enabling interrupts in step 5, the OS could context switch and steps 6, 7, and 8 could happen some time much later. Is this fine? Or should the enable all interrupts part actually be moved to the last step?

I am also still curious on the busy wait worst case timing question as soon as the developer is back and able to help answer.

Thanks for your help!

njdnzy commented 2 years ago

hi @abascom, I'm the SDK flexcan driver developer. For the first question, base on my understanding of the problem, the worst case wait time before sending each frame is 2 frame bits of time (if CAN bitrate is 500kbps, the wait time is 4us).

For the second question, it is no need to move enable all interrupts to the last step. For this errata, there is no issue if the CPU guarantees that any MB configured for transmission will not be aborted or deactivated (write 0x0 into CS Code field is an abort operation) in a short interval. If we do not turn off interrupts before doing step 4, we cannot guarantee that this MB has not just been configured, however, once we have performed step 4, neither reconfiguring the MB nor aborting it again will cause an error to be generated, so there is no need to disable the interrupt.

abascom commented 2 years ago

I've entered pull request (https://github.com/NXPmicro/mcux-sdk/pull/72) for this issue.