Open lgrguricmileusnic opened 2 weeks ago
It's quite possibly a bug! I haven't done full testing on this, just enough to send some data and see it arrive, which worked for me fine.
So to summarize: you're seeing the payload being corrupted by what you suspect is some other metadata?
That could come from lots of places in the pipeline from app->driver->hardware->driver->app. So the first thing we need to do is work out if it's the send side or the received side. Do you have a logic analyzer you can put on the bus and we can see what's actually going across it? (Or maybe a bus analyzer, like the PEAK?)
If you don't have a logic analyzer, then one of the very cheap USB 24MHz 8 channel ones (like $20) will do the job: they're Sigrok-compatible, and you can install the open source Pulseview to see the waveform. Pulseview comes with a protocol decoder for CAN FD, so you can see the payload decoded.
Is the bug deterministic? (i.e. each time you run the test, the payload is always the same). And if you change the payload to something else, is it always the same bytes that get corrupted? I'd like to rule out an interrupt or something corrupting the data.
Thank you for the quick response, I will continue working on this and keep you updated with what I find out.
I think I can rule out the app part, as printing the frame using the same print_frame
function after creating the frame with can_make_frame
and before sending it to the controller prints an uncorrupted frame (timestamp set to 0):
0x123 deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (0)
I have a LA104 logic analyser from Miniware, which isn't directly sigrok compatible, but can export recorded logic data as a CSV file. I will try to load the recording into Pulseview.
The behaviour is always the same and happens on both boards (both are running the same example with different arbitration IDs). It also happens on every message, so completely deterministic. For now, my best guess is the issue is in the driver part of the pipeline, where some of the registers in the MCP251863 are either getting overwritten on the sending side or misread on the receiving side, maybe due to the longer payloads supported by CAN-FD?
Yes, it will likely be one of those two. If we can work out which then it will help isolate it.
I can probably find it via printf() rather than firing up a SWD debugger, and I'll write a soak test to find it, but it would be good to know which side to look for the problem. It might be down in the SPI drivers (they are now particularly complex to work around the silicon bug, and it could be due to that complexity - though not the bug tripping, that would be non-deterministic).
Hello,
I had issues with the logic analyser I already have, so I order the cheap one you suggested. I had to wait a few days for the logic analyser to arrive. I connected one of the channels to the Tx pin of the sending CANPico. The issue seems to be on the receiving side, as the frames are sent correctly and aren't corrupted on the bus:
Displayed are the first few deadbeefXX
strings, the first one has the counter which is set to 0x0e
and the following ones are unmodified. In the serial output of the receiving CANPico, the corrupted part occurs on the 9th byte (after deadbeefCCdeadbe
, where CC
is the counter value), but that modification is is not present in the frame on the bus.
Pulseview file: frame.sr.zip
Thanks for this! Will go bug hunting now!
Hello,
I found the bug. In rx_handler
, when reading the rest of an FD frame, addr
should be incremented by 20 instead of 5, since each address represents 1 byte, not 1 word:
https://github.com/kentindell/canis-can-sdk/blob/canfd/mcp25xxfd/mcp25xxfd.c#L1150
I changed it to this:
if (len_words > 2U) {
// Copy out the rest of the FD payload
uint32_t tmp[16U];
read_words(spi_interface, addr + 20U, tmp + 2U, len_words - 2U);
for (size_t i = 2U; i < len_words; i++) {
frame.fd_data[i] = mcp25xxfd_convert_bytes(tmp[i]);
}
}
However, now I am encountering a different bug, where in every seventh message only the first 2 words of the payload are valid and the rest is zeroed out:
0x234 deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (7458888)
0x234 deadbeef01deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (7958992)
0x234 deadbeef02deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (8459107)
0x234 deadbeef03deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (8959219)
0x234 deadbeef04deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (9459331)
0x234 deadbeef05deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (9959441)
0x234 deadbeef06deadbe0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 (10459552)
0x234 deadbeef07deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (10959660)
0x234 deadbeef08deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (11459772)
0x234 deadbeef09deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (11959882)
0x234 deadbeef0adeadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (12459995)
0x234 deadbeef0bdeadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (12960107)
0x234 deadbeef0cdeadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef00deadbeef (13460233)
0x234 deadbeef0ddeadbe0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 (13960346)
I am not sure what causes this, maybe the memory gets zeroed out after a certain number of reads? Or maybe the mistake is somewhere in my logic.
FInally, I managed to get the driver to work properly by reading out the whole frame memory at once, regardless of it being a FD frame or a regular CAN frame:
uint16_t addr = (uint16_t)read_word_crc(spi_interface, C1FIFOUA1) + 0x400U;
uint32_t r[19U]; // 16 words for the payload and 3 words for the metadata
read_words(spi_interface, addr, r, 19U);
// Extract metadata and create frame...
size_t len_words = (can_frame_get_data_len(&frame) + 3U) >> 2;
if (len_words < 2U) {
len_words = 2U;
}
for (size_t i = 0U; i < len_words; i++) {
frame.fd_data[i] = mcp25xxfd_convert_bytes(r[i + 3U]);
}
This completely fixes the issue, but is optimised only for 64 byte payloads.
I was looking at exactly this code yesterday. You're right, the address is a byte address, but the chip requires that the bottom 2 bits are 0 and that accesses to the memory are full 32-bit word reads, so it can never be accessed as a byte address.
I'm still not sure the read is succeeding: it would be better to test it with a rolling pattern, not a repeated pattern. The only different in the code is that there is a single transaction. It's possible that somehow this multiple transaction read is failing - and that would suggest something is wrong with the SPI part of it. And we should not discount the possibility of a hardware bug (there have been too many already). I will take another look at the SPI drivers: they may not be resetting the memory read transaction properly.
I am not sure it requires the bottom 2 bits are 0. I understood it in the sense that they assume they are 0, and just use the 30 most significant bits for addressing the memory. That way the read still suceeds, just at the last address that was word aligned.
If the start address of the transfer object was 0x400, when we try to read the rest of the FD data which is at 0x405, the read would succeed, but it would read from address 0x404, since it ignores the bottom 2 bits. That's why through the 2nd transaction we were getting the flags and the timestamp (R1 and R2), as they would be at 0x404 and 0x408 and then the first 2 words of the data are repeated, which are at 0x40C and 0x410. You can see the first 2 words of the data are repeated in the output, as each frame contains 2 message counters instead of 1 (and the counter should only come after the first deadbeef).
I apologise, I agree the pattern is not the most clear one, it stuck with me from when I first noticed the bug. I will repeat the test with a rolling one.
Regarding single and multiple transaction reads, in the official driver supplied by Microchip for their boards, the function DRV_CANFDSPI_ReceiveMessageGet
always does a single transaction read, ignoring the DLC and allowing the user to specify the number of payload bytes they wish to read. I don't think this approach makes the most sense, but just something to note. On the other hand, this way the read is always a single transaction, regardless of the format of the frame (CAN or CAN-FD).
If the multiple transaction approach doesn't succeed, maybe it would make sense to always do a single transaction read, but to determine the number of words to read through the C1FIFOCON1 register (PLSIZE)? And then to also allow the user to configure PLSIZE through can_setup_controller
?
The low level SPI drivers come from the Pico SDK, so there might be a possibility of of a race condition between the Chip Select GPIO being raised and the SPI transaction not being complete, which would corrupt the state machine. Having implemented an SPI machine in Verilog, I can see this is exactly the kind of thing that could trip up the next transaction if they don't reset everything on a falling edge of Chip Select. So I think there's something a bit weird about the transaction size.
Yes, I like the idea of configuring the number of words to read, so that at least the classic CAN features remain fast. It's still a bit of a hack, though, and I would rather the multiple transaction approach worked. I might have to get the logic analyzer out and look at the low-level SPI timing to see if the Chip Select is being raised too quickly. Or there might be a problem with de-asserting and then re-asserting it too quickly.
I did some more experiments and I now suspect that the SPI nCS end/start time is too short for the controller to spot the start of a second SPI transaction. So I overhauled the way the SPI interface works, which also has the nice side effect of speeding up the handler.
Now there is a new call to start a transaction and read data, but the SPI transaction is held open (i.e. nCS pin is not raised). Then there is a second call to read additional words and close the SPI transaction. This must be called, even if the number of additional words is zero, to close the SPI transaction (i.e. raise nCS).
I updated Hello World to have a new 64-byte payload, and to put the frame count in the last byte. This seems to work.
To make things a little more flexible, there's now a define for the CAN ID (defaults to 0x123 if nothing defined) and also a CMake build option to just receive without transmitting a CAN frame - this makes it a bit easier to set up a 2-board system.
Hello,
I modified the hello_can example to use CAN-FD and encountered a bug where
fd_data
contains the frame flags and the DLC. I am not 100% sure I am using the CAN-FD driver properly, so please forgive me if the error is on my part. The modified code is available on the canfd_bug branch of my fork.I expanded the data array to 64 bytes of repeating
0xdeadbeef
and changed the bitrate profile toCAN_BITRATE_FD_500K_2M
. I also set the FDF flag when making the frame and added an#ifdef
to easily define different arbitration IDs for two boards.I set up two CANPico v2 boards as shown:
Output when printing the received frame with DLC 15 (64 bytes) and only the FDF flag set (when the frame counter is 1):
It seems the first 8 bytes (
deadbeef01deadbe
) are transmitted normally, but then the flags and the DLC are inserted (in this example0x8f
or0b10001111
), after which comes other data I couldn't identify. I confirmed these are the DLC and the flags by setting the IDE flag and changing the DLC to 14, which changes8f
to9e
(in this example the counter has value0x3c
):After the unidentified bytes, the first 8 bytes of the data are repeated, as indicated by the first
0xdeadbeef
after which comes the frame counter01
.Is this a genuine bug or an error on my part? I will try to provide additional insight as I familiarise myself with the driver code and controller documentation.