Received CAN-FD frame contains flags and DLC in data

lgrguricmileusnic commented 2 weeks ago

Hello,

I modified the hello_can example to use CAN-FD and encountered a bug where fd_data contains the frame flags and the DLC. I am not 100% sure I am using the CAN-FD driver properly, so please forgive me if the error is on my part. The modified code is available on the canfd_bug branch of my fork.

I expanded the data array to 64 bytes of repeating 0xdeadbeef and changed the bitrate profile to CAN_BITRATE_FD_500K_2M. I also set the FDF flag when making the frame and added an #ifdef to easily define different arbitration IDs for two boards.

I set up two CANPico v2 boards as shown:

Output when printing the received frame with DLC 15 (64 bytes) and only the FDF flag set (when the frame counter is 1):

0x123 deadbeef01deadbe8f000000495c6b00deadbeef01deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbe (7035977)

It seems the first 8 bytes (deadbeef01deadbe) are transmitted normally, but then the flags and the DLC are inserted (in this example 0x8f or 0b10001111), after which comes other data I couldn't identify. I confirmed these are the DLC and the flags by setting the IDE flag and changing the DLC to 14, which changes 8f to 9e (in this example the counter has value 0x3c):

0x123 deadbeef3cdeadbe9e0000009f398102deadbeef3cdeadbeef00deadbeef00deadbeef00deadbeef00deadbeef00dead (42023327)

After the unidentified bytes, the first 8 bytes of the data are repeated, as indicated by the first 0xdeadbeef after which comes the frame counter 01.

Is this a genuine bug or an error on my part? I will try to provide additional insight as I familiarise myself with the driver code and controller documentation.

kentindell commented 2 weeks ago

It's quite possibly a bug! I haven't done full testing on this, just enough to send some data and see it arrive, which worked for me fine.

So to summarize: you're seeing the payload being corrupted by what you suspect is some other metadata?

That could come from lots of places in the pipeline from app->driver->hardware->driver->app. So the first thing we need to do is work out if it's the send side or the received side. Do you have a logic analyzer you can put on the bus and we can see what's actually going across it? (Or maybe a bus analyzer, like the PEAK?)

If you don't have a logic analyzer, then one of the very cheap USB 24MHz 8 channel ones (like $20) will do the job: they're Sigrok-compatible, and you can install the open source Pulseview to see the waveform. Pulseview comes with a protocol decoder for CAN FD, so you can see the payload decoded.

Is the bug deterministic? (i.e. each time you run the test, the payload is always the same). And if you change the payload to something else, is it always the same bytes that get corrupted? I'd like to rule out an interrupt or something corrupting the data.

lgrguricmileusnic commented 2 weeks ago

Thank you for the quick response, I will continue working on this and keep you updated with what I find out.

I think I can rule out the app part, as printing the frame using the same print_frame function after creating the frame with can_make_frame and before sending it to the controller prints an uncorrupted frame (timestamp set to 0):

0x123 deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (0)

I have a LA104 logic analyser from Miniware, which isn't directly sigrok compatible, but can export recorded logic data as a CSV file. I will try to load the recording into Pulseview.

The behaviour is always the same and happens on both boards (both are running the same example with different arbitration IDs). It also happens on every message, so completely deterministic. For now, my best guess is the issue is in the driver part of the pipeline, where some of the registers in the MCP251863 are either getting overwritten on the sending side or misread on the receiving side, maybe due to the longer payloads supported by CAN-FD?

kentindell commented 2 weeks ago

Yes, it will likely be one of those two. If we can work out which then it will help isolate it.

I can probably find it via printf() rather than firing up a SWD debugger, and I'll write a soak test to find it, but it would be good to know which side to look for the problem. It might be down in the SPI drivers (they are now particularly complex to work around the silicon bug, and it could be due to that complexity - though not the bug tripping, that would be non-deterministic).

lgrguricmileusnic commented 2 weeks ago

Hello,

I had issues with the logic analyser I already have, so I order the cheap one you suggested. I had to wait a few days for the logic analyser to arrive. I connected one of the channels to the Tx pin of the sending CANPico. The issue seems to be on the receiving side, as the frames are sent correctly and aren't corrupted on the bus: canfd_frame1

Displayed are the first few deadbeefXX strings, the first one has the counter which is set to 0x0e and the following ones are unmodified. In the serial output of the receiving CANPico, the corrupted part occurs on the 9th byte (after deadbeefCCdeadbe, where CC is the counter value), but that modification is is not present in the frame on the bus.

Pulseview file: frame.sr.zip

kentindell commented 2 weeks ago

Thanks for this! Will go bug hunting now!

lgrguricmileusnic commented 1 week ago

Hello,

I found the bug. In rx_handler, when reading the rest of an FD frame, addr should be incremented by 20 instead of 5, since each address represents 1 byte, not 1 word:

https://github.com/kentindell/canis-can-sdk/blob/canfd/mcp25xxfd/mcp25xxfd.c#L1150

I changed it to this:

    if (len_words > 2U) {
        // Copy out the rest of the FD payload
        uint32_t tmp[16U];
        read_words(spi_interface, addr + 20U, tmp + 2U, len_words - 2U);
        for (size_t i = 2U; i < len_words; i++) {
            frame.fd_data[i] = mcp25xxfd_convert_bytes(tmp[i]);
        }
    }

However, now I am encountering a different bug, where in every seventh message only the first 2 words of the payload are valid and the rest is zeroed out:

0x234 deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (7458888)
0x234 deadbeef01deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (7958992)
0x234 deadbeef02deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (8459107)
0x234 deadbeef03deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (8959219)
0x234 deadbeef04deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (9459331)
0x234 deadbeef05deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (9959441)
0x234 deadbeef06deadbe0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 (10459552)
0x234 deadbeef07deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (10959660)
0x234 deadbeef08deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (11459772)
0x234 deadbeef09deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (11959882)
0x234 deadbeef0adeadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (12459995)
0x234 deadbeef0bdeadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (12960107)
0x234 deadbeef0cdeadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (13460233)
0x234 deadbeef0ddeadbe0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 (13960346)

I am not sure what causes this, maybe the memory gets zeroed out after a certain number of reads? Or maybe the mistake is somewhere in my logic.

FInally, I managed to get the driver to work properly by reading out the whole frame memory at once, regardless of it being a FD frame or a regular CAN frame:

    uint16_t addr = (uint16_t)read_word_crc(spi_interface, C1FIFOUA1) + 0x400U;

    uint32_t r[19U]; // 16 words for the payload and 3 words for the metadata
    read_words(spi_interface, addr, r, 19U);

    // Extract metadata and create frame...

    size_t len_words = (can_frame_get_data_len(&frame) + 3U) >> 2;

    if (len_words < 2U) {
        len_words = 2U;
    }

    for (size_t i = 0U; i < len_words; i++) {
        frame.fd_data[i] = mcp25xxfd_convert_bytes(r[i + 3U]);
    }

This completely fixes the issue, but is optimised only for 64 byte payloads.

kentindell commented 1 week ago

I was looking at exactly this code yesterday. You're right, the address is a byte address, but the chip requires that the bottom 2 bits are 0 and that accesses to the memory are full 32-bit word reads, so it can never be accessed as a byte address.

I'm still not sure the read is succeeding: it would be better to test it with a rolling pattern, not a repeated pattern. The only different in the code is that there is a single transaction. It's possible that somehow this multiple transaction read is failing - and that would suggest something is wrong with the SPI part of it. And we should not discount the possibility of a hardware bug (there have been too many already). I will take another look at the SPI drivers: they may not be resetting the memory read transaction properly.

lgrguricmileusnic commented 1 week ago

I am not sure it requires the bottom 2 bits are 0. I understood it in the sense that they assume they are 0, and just use the 30 most significant bits for addressing the memory. That way the read still suceeds, just at the last address that was word aligned.

If the start address of the transfer object was 0x400, when we try to read the rest of the FD data which is at 0x405, the read would succeed, but it would read from address 0x404, since it ignores the bottom 2 bits. That's why through the 2nd transaction we were getting the flags and the timestamp (R1 and R2), as they would be at 0x404 and 0x408 and then the first 2 words of the data are repeated, which are at 0x40C and 0x410. You can see the first 2 words of the data are repeated in the output, as each frame contains 2 message counters instead of 1 (and the counter should only come after the first deadbeef).

I apologise, I agree the pattern is not the most clear one, it stuck with me from when I first noticed the bug. I will repeat the test with a rolling one.

Regarding single and multiple transaction reads, in the official driver supplied by Microchip for their boards, the function DRV_CANFDSPI_ReceiveMessageGet always does a single transaction read, ignoring the DLC and allowing the user to specify the number of payload bytes they wish to read. I don't think this approach makes the most sense, but just something to note. On the other hand, this way the read is always a single transaction, regardless of the format of the frame (CAN or CAN-FD).

If the multiple transaction approach doesn't succeed, maybe it would make sense to always do a single transaction read, but to determine the number of words to read through the C1FIFOCON1 register (PLSIZE)? And then to also allow the user to configure PLSIZE through can_setup_controller?

kentindell commented 1 week ago

The low level SPI drivers come from the Pico SDK, so there might be a possibility of of a race condition between the Chip Select GPIO being raised and the SPI transaction not being complete, which would corrupt the state machine. Having implemented an SPI machine in Verilog, I can see this is exactly the kind of thing that could trip up the next transaction if they don't reset everything on a falling edge of Chip Select. So I think there's something a bit weird about the transaction size.

Yes, I like the idea of configuring the number of words to read, so that at least the classic CAN features remain fast. It's still a bit of a hack, though, and I would rather the multiple transaction approach worked. I might have to get the logic analyzer out and look at the low-level SPI timing to see if the Chip Select is being raised too quickly. Or there might be a problem with de-asserting and then re-asserting it too quickly.

kentindell commented 4 days ago

I did some more experiments and I now suspect that the SPI nCS end/start time is too short for the controller to spot the start of a second SPI transaction. So I overhauled the way the SPI interface works, which also has the nice side effect of speeding up the handler.

Now there is a new call to start a transaction and read data, but the SPI transaction is held open (i.e. nCS pin is not raised). Then there is a second call to read additional words and close the SPI transaction. This must be called, even if the number of additional words is zero, to close the SPI transaction (i.e. raise nCS).

I updated Hello World to have a new 64-byte payload, and to put the frame count in the last byte. This seems to work.

To make things a little more flexible, there's now a define for the CAN ID (defaults to 0x123 if nothing defined) and also a CMake build option to just receive without transmitting a CAN frame - this makes it a bit easier to set up a 2-board system.

kentindell / canis-can-sdk

Received CAN-FD frame contains flags and DLC in data #11