Laogeodritt commented 2 years ago

Hello,

I've been debugging an issue where some frames are not transmitted upon write(); the affected frames seem random. I think I've tracked down a partial explanation and a fix, but I'm not sure of my root-cause explanation to conclude this is a correct/definitive fix.

Test setup

My test CAN network is:

A Teensy 4.0 is set up to respond to specific frames on the CANbus. It does not transmit any other frames.
Another test device that: a) sends a frame, b) waits for a response (or until a timeout of 50ms); c) waits 1000ms after read()ing a response before sending the next frame.
A sniffer/listener-only device.

Issue description

Sometimes, the Teensy will fail to transmit the response frame (very timing-dependent, especially timing between the read() and write(response) calls—I've seen from 1 out of 30000 frames to 3 out of 1000 frames affected). The frame will not send even after several seconds. The bus is idle during this time, as the other node is waiting for a response.

I've verified that the Teensy receives the frame, calls write(), and that mailboxStatus() shows the frame correctly written to a mailbox:

FIFO Enabled --> Interrupt Disabled
    FIFO Filters in use: 8
    Remaining Mailboxes: 8
        MB8 code: TX_DATA (Transmitting)(Extended Frame)(ID: 0x8D40020)(Payload: 5A D0 95 CA 5F AF D7 EB)
        MB9 code: TX_INACTIVE
        MB10 code: TX_INACTIVE
        MB11 code: TX_INACTIVE
        MB12 code: TX_INACTIVE
        MB13 code: TX_INACTIVE
        MB14 code: TX_INACTIVE
        MB15 code: TX_INACTIVE

That response frame does transmit if I call write() again with a new frame. In this case, BOTH the unsent frame and the new frame get sent. (The // sendDummy; commented line in my example code was an attempt to demonstrate that this is true.)

My test code, which right now is showing around 3-5 frames affected per 1000 (it expects 8 data bytes, and byte 0 to be 0x5a - it's just a leftover from the protocol we're using in our real system): node.cpp.txt

Possible fix?

I suspect this is a race condition of some kind? I've noticed that commenting out the line below seems to 100% fix the issue in testing, both in the test case above and when testing against our real firmware. If I understand it correctly, this line inactivates the mailbox before writing to it (IMXRT1060RM manual 44.7.7.2). However, everywhere writeTxMailbox() is called, the library code has already checked that the mailbox's status is inactive first, so this line doesn't seem necessary (I don't think the controller can externally change a TX mailbox out of INACTIVE state?).

https://github.com/tonton81/FlexCAN_T4/blob/98654c7844408b536a44c3f3693a9c9a7335b21a/FlexCAN_T4.tpp#L430

FCTP_FUNC void FCTP_OPT::writeTxMailbox(uint8_t mb_num, const CAN_message_t &msg) {
  writeIFLAGBit(mb_num);
  uint32_t code = 0;
  volatile uint32_t *mbxAddr = &(*(volatile uint32_t*)(_bus + 0x80 + (mb_num * 0x10)));
  // mbxAddr[0] = FLEXCAN_MB_CS_CODE(FLEXCAN_MB_CODE_TX_INACTIVE); // removed this line
  mbxAddr[1] = (( msg.flags.extended ) ? ( msg.id & FLEXCAN_MB_ID_EXT_MASK ) : FLEXCAN_MB_ID_IDSTD(msg.id));
  if ( msg.flags.remote ) code |= (1UL << 20);
  if ( msg.flags.extended ) code |= (3UL << 21);
  for ( uint8_t i = 0; i < (8 >> 2); i++ ) mbxAddr[2 + i] = (msg.buf[0 + i * 4] << 24) | (msg.buf[1 + i * 4] << 16) | (msg.buf[2 + i * 4] << 8) | msg.buf[3 + i * 4];
  code |= msg.len << 16;
  mbxAddr[0] = code | FLEXCAN_MB_CS_CODE(FLEXCAN_MB_CODE_TX_ONCE);
}

I don't 100% understand why this works / what causes the original problem, but I suspect something like a race condition, if you inactivate the mailbox before the first bit of the Intermission field and then reactivate it right after that first bit. (The first bit of the Intermission field triggers internal arbitration [44.7.3] and move-out [44.7.6.2]). Maybe with this timing, the controller fails to re-trigger internal arbitration when the mailbox reactivates and the bus goes/remains idle (44.7.3 says BusIdle should trigger arbitration)?

Anyway, apologies for the long description—this has been a complicated issue to narrow down. I would appreciate your thoughts on this and whether this fix in the FlexCAN_T4 library makes sense.

tonton81 commented 2 years ago

if you are writibg too fast, check your QUEUE TX size in constructor, default is 16, now if the transmit boxes are full when your shooting frames and the queues fill, surely you will miss frames. I've been blasting isotp payloads without issues and every frame was accounted for. increase the queue and try again. if you are having transmission issues, make sure A) a node acks the trandmits on the same bus and B) ensure proper termination on the bus

tonton81 commented 2 years ago

also as per the datasheet, you must inactivate the mailbox before writing to it, irregardless of state

Laogeodritt commented 2 years ago

Thanks for the quick response.

if you are writibg too fast, check your QUEUE TX size in constructor

I think my test setup already addresses this possibility—in this test setup, I'm only write()ing one frame at a time, or at most 2 frames (if 1 frame is 'stuck' as described, and I write a 2nd one, both will transmit). The TX mailboxes should not all be filled, let alone the queue. Some of my tests rain mailboxStatus() before and after every read/write and only showed one (or two) TX mailboxes ever filled, rest TX_INACTIVE.

I have verified that, specifically when this problem happens, there is only ONE frame in MB8 and the other mailboxes are empty. If no further write() call occurs and nothing happens on the bus (i.e. node 2 waits for the response from the teensy) then this one frame in MB8 never gets sent.

if you are having transmission issues, make sure A) a node acks the trandmits on the same bus and B) ensure proper termination on the bus

Yes, terminations are present at the far ends of the bus (linear topology), and I believe the controllers are correctly ACKing as I'm not getting ACK errors from the ESR register (via FlexCAN_T4::errors()).

also as per the datasheet, you must inactivate the mailbox before writing to it, irregardless of state

I interpreted this differently—how I read it is that if the MB is active (transmission pending), then you need to either abort or inactivate. (i.e. I read the "if" statement in step 2 here as applying to the entire step). Now that you mention it, I do see how that's ambiguous, though—did you have other indications that the "If the MB is active" statement only applies to the abort part, not the inactivate part?

Nonetheless, this was the one thing I tried that 100% resolved the issue in my testing... so if I'm wrong and this is a coincidence, I seem to be back at step 1 again, unfortunately.

(IMXRT1060 manual rev 3, 07/2021)

tonton81 commented 2 years ago

FLEXCANb_MCR(_bus) |= FLEXCAN_MCR_AEN; // TX ABORT FEATURE

I guess since the AEN bit is set then removing your line should sound about right, have you tried just deactivating the AEN bit since that does use inactive state? I don't really see a point for aborting transmits, when they are never guarenteed to be stopped and still send out possible garbage.

msadie commented 2 years ago

Sounds like you might be dealing with this issue addressed in the errata: ERR005829 https://www.nxp.com/docs/en/errata/IMXRT1060CE_B.pdf Out of curiosity, when you write a second frame and both are transmitted, do they appear on the bus in the reverse order?

Laogeodritt commented 2 years ago

@tonton81 Disabling the AEN bit doesn't seem to change the behaviour. It seems like the erratum msadie found applies to what I'm observing.

@msadie Thanks! re: your curiosity question, the two frames appear on the bus in the order that I write()'d them. I've only spot-checked a few dozen instances while debugging this, so if it was intermittently out-of-sequence in your case, I might not have caught similar cases on my end.

The description fits what I've been experiencing, and I tried adding this to writeTxMailbox() (a bit different from what is described in the errata) to test it out, and it does entirely stop the problem:

  if ( FLEXCAN_get_code(FLEXCANb_MBn_CS(_bus, 15)) == FLEXCAN_MB_CODE_TX_INACTIVE ) {
    volatile uint32_t *mbxAddr2 = &(*(volatile uint32_t*)(_bus + 0x80 + (15 * 0x10)));
    mbxAddr2[0] = FLEXCAN_MB_CS_CODE(FLEXCAN_MB_CODE_TX_INACTIVE);
    mbxAddr2[0] = FLEXCAN_MB_CS_CODE(FLEXCAN_MB_CODE_TX_INACTIVE);
  }
}

Not a "clean" fix obviously, but demonstrates it works.

Here I'm not reserving an entire TX mailbox to this de-blocking role, and I'm using mailbox 15 instead of the first available one, which would be 8 since I'm using a queue. I also tried some variations: writing my frame to MB9 and doing the workaround with MB14 and it still worked, so I'm not convinced of the need to reserve a mailbox for the workaround...

While looking around for any discussions on this erratum, I found the discussion between the both of you on the PJRC forum.

As discussed in that thread, I'm also not happy with the erratum saying you must reserve an entire mailbox to fixing the issue. But from my testing above, it seems like you don't have to, just find an unused mailbox to write TX_INACTIVE to twice?

Transceivers came up in the forum discussion—in my case I'm using MCP2551 for my test setup, and a TI ISO1050DUB in our "real" system (although my sniffer is still using a MCP2551 if I have it connected). @tonton81 did you ever investigate further into the transceivers for msadie's problem?

Right now, I'm personally leaning more towards a silicon problem in the specific conditions mentioned in the erratum that NXP's never addressed in their chips, no transceiver involvement—it's a "during this phase of a CAN frame" timing problem, which sounds more CAN controller related than transceiver (which would involve transition times and propagation delays only, I think? Unless a bad transceiver had out-of-spec prop delays).

To me it seems like there might be opportunity to introduce a workaround in the library so others could avoid encountering this (maybe only in some conditions where this bug could come up, e.g., if we can easily check that only one TX MB is TX_DATA and the rest are TX_INACTIVE... though that's a lot of wasted checks if you have to loop through and check each MB's C/S word). Any thoughts in this regard?

I'm also happy to apply this workaround in my own driver code wrapping around FlexCAN_T4 to resolve my current problems with the bug.

If it's relevant, it seems like the Linux kernel opted to implement the workaround as-written in 2014, and it's still in the FlexCAN driver right now (search for ERR005829, it shows up several times in comments).

tonton81 commented 2 years ago

when they say reserve the mailbox, i think we just push the search counter up by one so it ignores the first tx mailbox found,that would be easier than writing code to circumnvent the issue. if you manually write to the first mailbox well that's on you but yeah that wont be restricted like the tx box scan

tonton81 commented 2 years ago

or you just need to set the absolute first tx mailbox as write once (2x)? if so that can be done programatically

tonton81 commented 1 year ago

Double checked and transmit box is confirmed inactive before writing in all callers, so makes no point sending setting inactive before submitting, even though the reference manual steps say to clear it before writing data to it.... Have you had any more issues since disabling that line?

tonton81 commented 1 year ago

Then again, we are just setting/clearing the register to TX_INACTIVE code... mbxAddr[0] = FLEXCAN_MB_CS_CODE(FLEXCAN_MB_CODE_TX_INACTIVE);

So technically we are clearing everything while setting TX_INACTIVE...

Maybe try clearing ONLY the code field and setting it without touching the other fields of that register? The following should leave all the other bits intact while only changing the code field, can you see if this works in your setup? mbxAddr[0] = (mbxAddr[0] & 0xF0FFFFFF) | FLEXCAN_MB_CS_CODE(FLEXCAN_MB_CODE_TX_INACTIVE);

Laogeodritt commented 1 year ago

Hi @tonton81,

Apologies for the long delay since my last message. I had gotten pulled away for other work and expected to come back to investigating this issue within a week or two, but that turned into several months of other issues taking priority. I've recently returned to this project and this issue.

I implemented a solution similar to the one in the errata that appeared to reduce, but not always eliminate, this behaviour (but avoided "using up" a mailbox permanently to do so). Since the solution wasn't 100%, I had wanted to investigate further before reporting back here.

Double checked and transmit box is confirmed inactive before writing in all callers, so makes no point sending setting inactive before submitting, even though the reference manual steps say to clear it before writing data to it....

Yes, this was one of the things that confused me at first too, when I dove into FlexCAN_T4's code... Given the suggested solution in the errata, I suspect what might be happening is that if the bug has triggered and you write a frame to one TX mailbox, then writing the mbxAddr / the code field to a different mailbox circumvents the bug by forcing the CAN peripheral hardware to re-check the TX mailboxes.

I am away for a few days, but I intend to continue investigating this when I get back in the office and will report back with details on my attempted solution + results of your suggestion.

Regards,

@Laogeodritt

Laogeodritt commented 1 year ago

Hello! Very quick update: no change to the error behaviour in my test setup when making this change in writeTxMailbox:

Maybe try clearing ONLY the code field and setting it without touching the other fields of that register? The following should leave all the other bits intact while only changing the code field, can you see if this works in your setup? mbxAddr[0] = (mbxAddr[0] & 0xF0FFFFFF) | FLEXCAN_MB_CS_CODE(FLEXCAN_MB_CODE_TX_INACTIVE);

junknerd commented 9 months ago

I'm seeing a similar issue of occasional dropped frames and suspect failure to transmit. Any updates would be appreciated.

Teensy4.1, Arduino 1.8.19, FlexCAN_T4 at most recent commit on github, and read most of the PJRC thread on this library.

My project is a gateway between two 500k CAN busses and heavily (70-80%) loaded on an OEM vehicle with standard 11-bit ArbIDs. I believe the hardware is solid as I see zero dropped frames using a polling method with entirely default settings below.

Working code using polling method

#include "FlexCAN_T4.h"  // Local copy from github
FlexCAN_T4<CAN1, RX_SIZE_256, TX_SIZE_16> can1;
FlexCAN_T4<CAN2, RX_SIZE_256, TX_SIZE_16> can2;
FlexCAN_T4<CAN3, RX_SIZE_256, TX_SIZE_16> can3;
CAN_message_t rxmsg;

void setup() {
  can1.begin();
  can1.setBaudRate(500000);
  can2.begin();
  can2.setBaudRate(500000);
  can3.begin();
  can3.setBaudRate(500000);
}

void loop() {
  int i;
  int ret;

  // Loop through each bus, rebroadcast as required
  for (i = 0; i<3; i++){
    switch(i){
      case 0: ret = can1.read(rxmsg); break;  // Motor Side
      case 1: ret = can2.read(rxmsg); break;  // Vehicle Side
      case 2: ret = can3.read(rxmsg); break;  // Control
      default: ret = 0;
    }

    // Message on Bus 1 or 2 - rebroadcast it IN ORDER
    if (ret > 0 && i < 2){

      // Rebroadcast on opposite bus
      switch(i){
        case 0: ret = can2.write(rxmsg); break;  // 1 --> 2
        case 1: ret = can1.write(rxmsg); break;  // 2 --> 1
      }

      // Rebroadcast some to Control laptop
      if ((rxmsg.id == 0x122) || (rxmsg.id == 0x132)){
        can3.write(rxmsg); // 1 or 2 --> 3
      }

    // From Command PC
    } else if (ret > 0 && (i==2)){
        /*... Code to update extra variables ...*/
    }
  }
}

I don't expect this to be viable as the project progresses and more features of the Teensy are utilized. To ensure order, I have set up code to use the FIFO + interrupt method, both with onReceive(...) callbacks and directly using ext_output1(...)with no apparent difference, which leads me to suspect failure to transmit rather than failure to receive. From a high level, these errors seem rare - perhaps 1 dropped frame in 1k or 10k frames, but enough to fault out the system I am working with. With bus loads this high, I would expect the system to be more tolerant, but it's not and I'm stuck finding a solution.

FIFO + Interrupt + Sequential code occasionally dropping frames

#include "FlexCAN_T4.h" // Use latest github copy https://forum.pjrc.com/index.php?threads/forwarding-with-flexcan_t4-and-losing-messages.65698/post-265967

// With FIFO and direct callbacks, I don't think RX_SIZE_ matters (don't care if queue overflows), but TX_SIZE definitely does
// https://forum.pjrc.com/index.php?threads/flexcan_t4-flexcan-for-teensy-4.56035/post-279626
// https://forum.pjrc.com/index.php?threads/forwarding-with-flexcan_t4-and-losing-messages.65698/post-265967
FlexCAN_T4<CAN1, RX_SIZE_256, TX_SIZE_256> Can1;
FlexCAN_T4<CAN2, RX_SIZE_256, TX_SIZE_256> Can2; 

void setup(void) {
  Serial.begin(115200); delay(100);

  Can1.begin();
  Can1.setBaudRate(500000);
  Can1.setMaxMB(32);                // 8+1 minimum for FIFOs // https://forum.pjrc.com/index.php?threads/forwarding-with-flexcan_t4-and-losing-messages.65698/post-265909
  Can1.enableFIFO();                // MUST occur before /any/ other settings, or they will be lost
  Can1.enableFIFOInterrupt();
  Can1.setFIFOFilter(ACCEPT_ALL);   // Ensure we pass everything
  //Can1.onReceive(canRx0);         // Receiving any should trigger interrupt (i.e. don't specify a MB number here, becomes Main Handler)
  Can1.mailboxStatus();

  Can2.begin();
  Can2.setBaudRate(500000);
  Can2.setMaxMB(32);
  Can2.enableFIFO();
  Can2.enableFIFOInterrupt();
  Can2.setFIFOFilter(ACCEPT_ALL);
  //Can2.onReceive(canRx1);         // Receiving any should trigger interrupt (i.e. don't specify a MB number here)
  Can2.mailboxStatus();
}

volatile unsigned rxcnt1 = 0;
volatile unsigned rxcnt2 = 0;

// Override CAN interrupt function
void ext_output1(const CAN_message_t &msg){
  CAN_message_t txframe = msg;
  txframe.seq = 1;              // Force sequential transmit - may ONLY use CanX.write(frame) NOT CanX.write(MBxx, frame)

  //TODO: Update library so checking <>.write() and sub-functions pass up int for troubleshooting

  if (msg.bus == 1){
    Can2.write(txframe);        // Write to any available transmit mailbox, Note, sequential frames MUST use this function ONLY.
                        // NOTE: Checking return value is only marginally useful because struct2queueTx() is typed bool for some reason
    ++rxcnt1;
  } else {
    Can1.write(txframe);        // Write to any available transmit mailbox, Note, sequential frames MUST use this function ONLY.
    ++rxcnt2;
  }

  //TODO Consider modifying library so a negative/zero return here halts further operations (i.e. queuing or distribution) and zero/positive continues normally so these functions could layer filtering/consumption "not for me, try the next one"
}

/*
void canRx0(const CAN_message_t &f){
  CAN_message_t txframe = msg;
  txframe.seq = 1;     // Force sequential transmit - may ONLY use CanX.write(frame) NOT CanX.write(MBxx, frame)
  Can1.write(txframe); // Write to any available transmit mailbox, Note, sequential frames MUST use this function ONLY.
  ++rxcnt0;
}

void canRx1(const CAN_message_t &f){
  CAN_message_t txframe = msg;
  txframe.seq = 1;     // Force sequential transmit - may ONLY use CanX.write(frame) NOT CanX.write(MBxx, frame)
  Can0.write(txframe); // Write to any available transmit mailbox, Note, sequential frames MUST use this function ONLY.
  ++rxcnt1;
}*/

void loop() {
  Serial.printf("Rx1 %10u Rx2: %10u\n",rxcnt1, rxcnt2);
  delay(500);
}

I have not tested [Laogeodritt]'s workaround but plan to shortly. I will also be adding some checks to verify the TX queue is not overflowing, checking <>.write() return values along with changing struct2queueTx() to int, and using setClock() to bump up the oscillator and see if those help. Any other suggestions would be greatly appreciated.

Thank you for the amazing library, including the ISO15765 support, and server which is awesome for training and testing.

tonton81 / FlexCAN_T4

Some frames not transmitted upon write() - race condition between mailbox write and bus state? #55

Test setup

Issue description

Possible fix?