raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.15k stars 5k forks source link

mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty 2-CH CAN FD HAT Rev2.1 #5083

Open DavidBoJ opened 2 years ago

DavidBoJ commented 2 years ago

Describe the bug

In my application the /var/log/syslog is filled up with: mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty After a while the disk is full and the system can crash. Is there any way I can disable the logging from the CAN bus?

I have two slaves on my network to which it is difficult to establish a connection. With two other slaves it seems to work (But I didn't check the logs) I have controlled the two slaves with Pican2 in the past but not with Bullseye The physical network is 1.2m Any suggestions @marckleinebudde https://github.com/marckleinebudde ?

Steps to reproduce the behaviour

Difficult to reproduce the exact same result every time. But 2 slaves Nanotec motor drivers with node id 1 and 2 [CL4-E-2-12-5VDI] An application that initializes these with an SDO The configuration of the HAT is seen under the system description.

Device (s)

Raspberry Pi 4 Mod. B

System

2-CH CAN FD HAT Rev2.1

ip -d link show dev can0 4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10 link/can promiscuity 0 minmtu 0 maxmtu 0 can state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 bitrate 250000 sample-point 0.875 tq 25 prop-seg 69 phase-seg1 70 phase-seg2 20 sjw 1 mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 clock 40000000 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

uname -a Linux cilix-19 5.15.32-v7l+1538 SMP Thu Mar 31 19:39:41 BST 2022 armv7l GNU/Linux On a Raspberry Pi4

pi@cilix-19:~ $ cat /etc/rpi-issue Raspberry Pi reference 2022-04-04 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 226b479f8d32919c9fe36dd5b4c20c02682f8180, stage2 pi@cilix-19:~ $ vcgencmd version Mar 24 2022 13:19:26 Copyright (c) 2012 Broadcom version e5a963efa66a1974127860b42e913d2374139ff5 (clean) (release) (start)

Logs

No response

Additional context

No response

marckleinebudde commented 2 years ago

How often do you see this event?

mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty

The driver throws this message if the chip does't behave as the driver expects. It's unclear if this is a bug in the driver or in the chip. It doesn't happen that often (during my testing), the driver recovers and I haven't had time to debug this issue. Can you describe you use case, maybe hit sheds time light to the problem?

You can make the driver silent by changing the netdev_info() to a netdev_dbg:

--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-tef.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-tef.c
@@ -72,7 +72,7 @@ mcp251xfd_handle_tefif_recover(const struct mcp251xfd_priv *priv, const u32 seq)
                return -ENOBUFS;
        }

-       netdev_info(priv->ndev,
+       netdev_dbg(priv->ndev,
                    "Transmit Event FIFO buffer %s. (seq=0x%08x, tef_tail=0x%08x, tef_head=0x%08x, tx_head=0x%08x).\n",
                    tef_sta & MCP251XFD_REG_TEFSTA_TEFFIF ?
                    "full" : tef_sta & MCP251XFD_REG_TEFSTA_TEFNEIF ?
DavidBoJ commented 2 years ago

I have problems as soon I connect two slaves (CANopen devices not supporting FD) to my network with node id 1 and 2. I think it works with only one slave. I also had two other slaves which worked reasonably stable (However I did not do any longtime tests). In other words, a specially crafted network seems to cause the error. I do not exclude that one of the slaves is faulty. My (CODESYS) application tries to initialize the slaves by sending/receiving a SDO and when that fails it tries again and again and it very rarely gets over the initialization. Maybe CODESYS does not access the driver properly or maybe somehow bypasses it? It is the first time I use Bullseye and this 2-CH CAN FD HAT. I have two identical CAN FD HATs, I have problems with both. If they are defects then a production batch error has caused it or the faulty network has caused damage to the chip or corrupted the driver. Tomorrow, I will set up a python test between ch0 and ch1 and see if the driver still is valid and I will try to switch it on and off several times. By the way, I didn't follow the waveshare instruction to install the bcm2835 library since the bcm2835 library already is part of Bullseye.

marckleinebudde commented 2 years ago

The bcm2835 library is not needed by the kernel driver for the mcp251xfd.

How often do you get event the mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty events?

My (CODESYS) application tries to initialize the slaves by sending/receiving a SDO and when that fails it tries again and again and it very rarely gets over the initialization.

What does exactly happen when the init fails? Is there a timeout? Goes the bus into bus off? Can you send me the output of candump -l any,0~0,#FFFFFFFF when the application fails?

Maybe CODESYS does not access the driver properly or maybe somehow bypasses it?

Do you know if CODESYS uses the regular can0 network interface?

DavidBoJ commented 2 years ago

The first thing I did was to simplify my CODESYS application so only SDO initialization takes place, and only PDO rx/tx is possible after the initialization. All code removed. Next I stopped the application before the flash got full. And a closer look in /var/log/syslog gave the following:

Jul 6 13:29:13 cilix-19 kernel: [ 465.249474] Disabling IRQ #82 Jul 6 13:30:57 cilix-19 kernel: [ 569.069795] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:57 cilix-19 kernel: [ 569.070135] mcp251xfd spi0.0 can0: CRC write command format error. Jul 6 13:30:57 cilix-19 kernel: [ 569.179548] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:57 cilix-19 kernel: [ 569.289559] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:57 cilix-19 kernel: [ 569.399558] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:57 cilix-19 kernel: [ 569.509334] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:57 cilix-19 kernel: [ 569.619364] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:57 cilix-19 kernel: [ 569.729539] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:57 cilix-19 kernel: [ 569.839605] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:58 cilix-19 kernel: [ 569.949563] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:58 cilix-19 kernel: [ 570.059586] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:58 cilix-19 kernel: [ 570.169766] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:58 cilix-19 kernel: [ 570.279505] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:58 cilix-19 kernel: [ 570.389495] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:58 cilix-19 kernel: [ 570.499618] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:58 cilix-19 kernel: [ 570.609629] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:58 cilix-19 kernel: [ 570.719642] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:58 cilix-19 kernel: [ 570.829666] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:59 cilix-19 kernel: [ 570.939639] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:59 cilix-19 kernel: [ 571.049636] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:59 cilix-19 kernel: [ 571.159558] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:59 cilix-19 kernel: [ 571.269237] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:59 cilix-19 kernel: [ 571.379625] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:59 cilix-19 kernel: [ 571.489790] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:59 cilix-19 kernel: [ 571.599422] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:59 cilix-19 kernel: [ 571.719633] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:30:59 cilix-19 kernel: [ 571.829616] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:00 cilix-19 kernel: [ 571.939431] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:00 cilix-19 kernel: [ 572.049609] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:00 cilix-19 kernel: [ 572.159617] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:00 cilix-19 kernel: [ 572.269666] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:00 cilix-19 kernel: [ 572.379428] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:00 cilix-19 kernel: [ 572.489630] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:00 cilix-19 kernel: [ 572.599608] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:02 cilix-19 kernel: [ 573.929815] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:02 cilix-19 kernel: [ 574.039658] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:02 cilix-19 kernel: [ 574.149618] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:02 cilix-19 kernel: [ 574.259624] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:03 cilix-19 kernel: [ 575.479874] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:03 cilix-19 kernel: [ 575.589693] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:03 cilix-19 kernel: [ 575.699641] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:03 cilix-19 kernel: [ 575.809891] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:05 cilix-19 kernel: [ 577.238413] mcp251xfd spi0.0 can0: CRC write command format error. Jul 6 13:31:06 cilix-19 kernel: [ 578.139924] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:06 cilix-19 kernel: [ 578.249926] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 6 13:31:06 cilix-19 kernel: [ 578.468439] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d). Jul 6 13:31:06 cilix-19 kernel: [ 578.468677] mcp251xfd spi0.0 can0: CRC write command format error. Jul 6 13:31:06 cilix-19 kernel: [ 578.469061] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d). Jul 6 13:31:06 cilix-19 kernel: [ 578.469330] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d). Jul 6 13:31:06 cilix-19 kernel: [ 578.469608] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d). Jul 6 13:31:06 cilix-19 kernel: [ 578.469876] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d). Jul 6 13:31:06 cilix-19 kernel: [ 578.470142] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).

From here on it goes fast and the flash is filled up. It seems Jul 6 13:30:57 cilix-19 kernel: [ 569.069795] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. creates a domino effect.

When I do a warm reset in CODESYS (it clears all variables and stops the application) I get Message from syslogd@cilix-19 at Jul 6 13:21:54 ... kernel:[ 25.987276] Disabling IRQ #82

Is that acceptable? I have. cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 26: 0 0 0 0 GICv2 29 Level arch_timer 27: 201651 123206 44138 7212 GICv2 30 Level arch_timer 30: 0 0 0 0 GICv2 107 Level fe004000.txp 31: 440 0 0 0 GICv2 65 Level fe00b880.mailbox 34: 6770 0 0 0 GICv2 153 Level uart-pl011 35: 0 0 0 0 GICv2 150 Level fe204000.spi 36: 0 0 0 0 GICv2 125 Level fe215080.spi 37: 0 0 0 0 GICv2 129 Level vc4 hvs 40: 342 0 0 0 GICv2 114 Level DMA IRQ 42: 8 0 0 0 GICv2 116 Level DMA IRQ 43: 0 0 0 0 GICv2 117 Level DMA IRQ 44: 0 0 0 0 GICv2 118 Level DMA IRQ 45: 0 0 0 0 GICv2 119 Level DMA IRQ 47: 0 0 0 0 GICv2 141 Level vc4 crtc 48: 0 0 0 0 GICv2 142 Level vc4 crtc, vc4 crtc 49: 0 0 0 0 GICv2 133 Level vc4 crtc 50: 0 0 0 0 GICv2 138 Level vc4 crtc 51: 0 0 0 0 interrupt-controller@7ef00100 0 Edge vc4 hdmi cec tx 52: 0 0 0 0 interrupt-controller@7ef00100 1 Edge vc4 hdmi cec rx 55: 0 0 0 0 interrupt-controller@7ef00100 4 Edge vc4 hdmi hpd connected 56: 0 0 0 0 interrupt-controller@7ef00100 5 Edge vc4 hdmi hpd disconnected 57: 0 0 0 0 interrupt-controller@7ef00100 8 Edge vc4 hdmi cec tx 58: 0 0 0 0 interrupt-controller@7ef00100 7 Edge vc4 hdmi cec rx 61: 0 0 0 0 interrupt-controller@7ef00100 10 Edge vc4 hdmi hpd connected 62: 0 0 0 0 interrupt-controller@7ef00100 11 Edge vc4 hdmi hpd disconnected 63: 73 0 0 0 GICv2 66 Level VCHIQ doorbell 64: 11201 0 0 0 GICv2 158 Level mmc1, mmc0 65: 0 0 0 0 GICv2 48 Level arm-pmu 66: 0 0 0 0 GICv2 49 Level arm-pmu 67: 0 0 0 0 GICv2 50 Level arm-pmu 68: 0 0 0 0 GICv2 51 Level arm-pmu 71: 843 0 0 0 GICv2 189 Level eth0 72: 31 0 0 0 GICv2 190 Level eth0 78: 0 0 0 0 GICv2 106 Level v3d 79: 0 0 0 0 GICv2 175 Level PCIe PME 80: 38 0 0 0 BRCM STB PCIe MSI 524288 Edge xhci_hcd 82: 100001 0 0 0 pinctrl-bcm2835 25 Level spi0.0 IPI0: 0 0 0 0 CPU wakeup interrupts IPI1: 0 0 0 0 Timer broadcast interrupts IPI2: 174 158 197 164 Rescheduling interrupts IPI3: 3947 122297 217123 215392 Function call interrupts IPI4: 0 0 0 0 CPU stop interrupts IPI5: 726 135 186 132 IRQ work interrupts IPI6: 0 0 0 0 completion interrupts Err: 0

You requested the result of "candump -l any,0~0,#FFFFFFFF" here it is: less candump-2022-07-06_163011.log (1657121412.229819) can0 20000004#0008000000007F00 (1657121414.649739) can0 20000004#0040000000005F00 (1657121417.292012) can0 20000004#0001000000000000 (1657121417.401347) can0 20000004#0001000000000000 (1657121418.722186) can0 20000004#0001000000000000 (1657121418.831834) can0 20000004#0001000000000000 (1657121418.941893) can0 20000004#0001000000000000 (1657121419.051251) can0 20000004#0001000000000000 (1657121419.161551) can0 20000004#0001000000000000 (1657121419.271232) can0 20000004#0001000000000000 (1657121419.381527) can0 20000004#0001000000000000 (1657121419.491764) can0 20000004#0001000000000000 (1657121419.601958) can0 20000004#0001000000000000 (1657121419.711429) can0 20000004#0001000000000000

I still think that a special crafted network creates the fault, and I do not exclude my Nanotec motor driver to be faulty.

marckleinebudde commented 2 years ago
Jul 6 13:29:13 cilix-19 kernel: [ 465.249474] Disabling IRQ #82
Jul 6 13:30:57 cilix-19 kernel: [ 569.069795] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:30:57 cilix-19 kernel: [ 569.070135] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 6 13:30:57 cilix-19 kernel: [ 569.179548] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.

[...]

Jul 6 13:31:03 cilix-19 kernel: [ 575.809891] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:05 cilix-19 kernel: [ 577.238413] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 6 13:31:06 cilix-19 kernel: [ 578.139924] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:06 cilix-19 kernel: [ 578.249926] mcp251xfd spi0.0 can0: RX-0: FIFO overflow.
Jul 6 13:31:06 cilix-19 kernel: [ 578.468439] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.468677] mcp251xfd spi0.0 can0: CRC write command format error.
Jul 6 13:31:06 cilix-19 kernel: [ 578.469061] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.469330] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.469608] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.469876] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).
Jul 6 13:31:06 cilix-19 kernel: [ 578.470142] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000134, tef_tail=0x0000013c, tef_head=0x0000013d, tx_head=0x0000013d).

From here on it goes fast and the flash is filled up. It seems Jul 6 13:30:57 cilix-19 kernel: [ 569.069795] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. creates a domino effect.

Yes, that's at least the first message from the driver itself. But the message directly before this is more important:

Jul 6 13:29:13 cilix-19 kernel: [ 465.249474] Disabling IRQ #82

From /proc/interrupts we see that IRQ 82 is...

82: 100001 0 0 0 pinctrl-bcm2835 25 Level spi0.0

...the interrupt line between the MCP2518FD chip and the raspi. That's not good.

When I do a warm reset in CODESYS (it clears all variables and stops the application) I get Message from syslogd@cilix-19 at Jul 6 13:21:54 ...

kernel:[ 25.987276] Disabling IRQ #82

Is that acceptable?

No, see above.

I have.

cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
26: 0 0 0 0 GICv2 29 Level arch_timer
27: 201651 123206 44138 7212 GICv2 30 Level arch_timer
30: 0 0 0 0 GICv2 107 Level fe004000.txp
31: 440 0 0 0 GICv2 65 Level fe00b880.mailbox
34: 6770 0 0 0 GICv2 153 Level uart-pl011
35: 0 0 0 0 GICv2 150 Level fe204000.spi
36: 0 0 0 0 GICv2 125 Level fe215080.spi
37: 0 0 0 0 GICv2 129 Level vc4 hvs
40: 342 0 0 0 GICv2 114 Level DMA IRQ
42: 8 0 0 0 GICv2 116 Level DMA IRQ
43: 0 0 0 0 GICv2 117 Level DMA IRQ
44: 0 0 0 0 GICv2 118 Level DMA IRQ
45: 0 0 0 0 GICv2 119 Level DMA IRQ
47: 0 0 0 0 GICv2 141 Level vc4 crtc
48: 0 0 0 0 GICv2 142 Level vc4 crtc, vc4 crtc
49: 0 0 0 0 GICv2 133 Level vc4 crtc
50: 0 0 0 0 GICv2 138 Level vc4 crtc
51: 0 0 0 0 interrupt-controller@7ef00100 0 Edge vc4 hdmi cec tx
52: 0 0 0 0 interrupt-controller@7ef00100 1 Edge vc4 hdmi cec rx
55: 0 0 0 0 interrupt-controller@7ef00100 4 Edge vc4 hdmi hpd connected
56: 0 0 0 0 interrupt-controller@7ef00100 5 Edge vc4 hdmi hpd disconnected
57: 0 0 0 0 interrupt-controller@7ef00100 8 Edge vc4 hdmi cec tx
58: 0 0 0 0 interrupt-controller@7ef00100 7 Edge vc4 hdmi cec rx
61: 0 0 0 0 interrupt-controller@7ef00100 10 Edge vc4 hdmi hpd connected
62: 0 0 0 0 interrupt-controller@7ef00100 11 Edge vc4 hdmi hpd disconnected
63: 73 0 0 0 GICv2 66 Level VCHIQ doorbell
64: 11201 0 0 0 GICv2 158 Level mmc1, mmc0
65: 0 0 0 0 GICv2 48 Level arm-pmu
66: 0 0 0 0 GICv2 49 Level arm-pmu
67: 0 0 0 0 GICv2 50 Level arm-pmu
68: 0 0 0 0 GICv2 51 Level arm-pmu
71: 843 0 0 0 GICv2 189 Level eth0
72: 31 0 0 0 GICv2 190 Level eth0
78: 0 0 0 0 GICv2 106 Level v3d
79: 0 0 0 0 GICv2 175 Level PCIe PME
80: 38 0 0 0 BRCM STB PCIe MSI 524288 Edge xhci_hcd
82: 100001 0 0 0 pinctrl-bcm2835 25 Level spi0.0
IPI0: 0 0 0 0 CPU wakeup interrupts
IPI1: 0 0 0 0 Timer broadcast interrupts
IPI2: 174 158 197 164 Rescheduling interrupts
IPI3: 3947 122297 217123 215392 Function call interrupts
IPI4: 0 0 0 0 CPU stop interrupts
IPI5: 726 135 186 132 IRQ work interrupts
IPI6: 0 0 0 0 completion interrupts
Err: 0

You requested the result of "candump -l any,0~0,#FFFFFFFF" here it is: less candump-2022-07-06_163011.log

(1657121412.229819) can0 20000004#0008000000007F00
(1657121414.649739) can0 20000004#0040000000005F00
(1657121417.292012) can0 20000004#0001000000000000
(1657121417.401347) can0 20000004#0001000000000000
(1657121418.722186) can0 20000004#0001000000000000
(1657121418.831834) can0 20000004#0001000000000000
(1657121418.941893) can0 20000004#0001000000000000
(1657121419.051251) can0 20000004#0001000000000000
(1657121419.161551) can0 20000004#0001000000000000
(1657121419.271232) can0 20000004#0001000000000000
(1657121419.381527) can0 20000004#0001000000000000
(1657121419.491764) can0 20000004#0001000000000000
(1657121419.601958) can0 20000004#0001000000000000
(1657121419.711429) can0 20000004#0001000000000000

Ok, there are some error messages from the controller, but I forgot to give you the command line to let candump decode the error message, sorry. Try this one instead:

`candump any,0~0,#FFFFFFFF -exdtA

I still think that a special crafted network creates the fault, and I do not exclude my Nanotec motor driver to be faulty.

There are some CRC write errors in the log:

Jul 6 13:31:06 cilix-19 kernel: [ 578.468677] mcp251xfd spi0.0 can0: CRC write command format error.

That means the SPI message form the raspi to the mcp2518fd got corrupted somehow. Is it possible that your motor driver creates EMI and destroys the SPI message? Are you using a shared power supply for the raspi and the motors?

Marc

DavidBoJ commented 2 years ago

My system is very simple given in my first post. I have only one HAT the CAN bus controller and I am only using can0 I have updated my first post so firmware version can be seen. I am aware about the EMI problems, and the power supply to Pi is not the same as the one to the motor drivers, and the motors are of course not yet energized. I swear that the following worked before I installed codesys and did the CANopen initializing. pi@cilix-19:~ $ sudo ip link set can1 up type can bitrate 250000 Cannot find device "can1"

Something must have corrupted the CANbus driver, I see the following options. 1) My image has not been stable from the start.(Is there a way to verify the image or installed driver?) 2) My codesys application before I simplified it, corrupted the image. 3) CODESYS has caused it 4) My can bus network has generated faulty signals which somehow caused it 5) The heavy logging, filling up the flash until the system crashes combined with the above could cause it 6) The hardware CANbus chip is faulty Does my config.txt look alright? I have added the start of the syslog, so you can see what happens before the disabling. config.txt syslog.txt

candump any,0~0,#FFFFFFFF -exdtA gave the following result: (1657180670.772316) can0 20000004#0008000000007F00 R (1657180671.433302) can0 20000004#0040000000005B00 R (1657180671.881853) can0 20000004#0020000000008300 R (1657180672.433488) can0 20000004#0008000000007A00 R (1657180672.873334) can0 20000004#0040000000005900 R (1657180675.634347) can0 20000004#0001000000000000 R (1657180675.744045) can0 20000004#0001000000000000 R (1657180675.853950) can0 20000004#0001000000000000 R (1657180675.963906) can0 20000004#0001000000000000 R (1657180676.074172) can0 20000004#0001000000000000 R (1657180676.183875) can0 20000004#0001000000000000 R (1657180676.294112) can0 20000004#0001000000000000 R (1657180676.403747) can0 20000004#0001000000000000 R (1657180676.513908) can0 20000004#0001000000000000 R (1657180676.623605) can0 20000004#0001000000000000 R (1657180676.733859) can0 20000004#0001000000000000 R (1657180676.843880) can0 20000004#0001000000000000 R (1657180676.954108) can0 20000004#0001000000000000 R (1657180677.064234) can0 20000004#0001000000000000 R

marckleinebudde commented 2 years ago

Can you try to disable the CODESYS altogether and/or flash a new µSD card with a fresh system.

Jul  7 08:44:57 cilix-19 kernel: [   20.976582] can: controller area network core
Jul  7 08:44:57 cilix-19 kernel: [   20.976667] NET: Registered PF_CAN protocol family
Jul  7 08:44:57 cilix-19 kernel: [   20.986872] can: raw protocol
Jul  7 08:44:58 cilix-19 kernel: [   22.046365] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
Jul  7 08:44:59 cilix-19 kernel: [   22.261509] mcp251xfd spi0.0 can0: CRC read error at address 0x001c (length=4, data=00 00 00 00, CRC=0x0000) retrying.
Jul  7 08:44:59 cilix-19 kernel: [   22.261608] mcp251xfd spi0.0 can0: CRC write command format error.
Jul  7 08:44:59 cilix-19 kernel: [   22.361504] mcp251xfd spi0.0 can0: CRC read error at address 0x001c (length=4, data=00 00 00 00, CRC=0x0000) retrying.
Jul  7 08:44:59 cilix-19 kernel: [   22.361598] mcp251xfd spi0.0 can0: CRC write command format error.

From this log we see that the SPI controller doesn't read anything from the mcp2518fd controller, as the data and crc is 00. Please make sure that no other component touches the chip select and the MISO/MOSI pins.

DavidBoJ commented 2 years ago

pi@cilix-19:~ $ uname -a Linux cilix-19 5.15.32-v7l+ #1538 SMP Thu Mar 31 19:39:41 BST 2022 armv7l GNU/Linux

pi@cilix-19:~ $ cat /etc/rpi-issue Raspberry Pi reference 2022-04-04 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 226b479f8d32919c9fe36dd5b4c20c02682f8180, stage2

pi@cilix-19:~ $ vcgencmd version Mar 24 2022 13:19:26 Copyright (c) 2012 Broadcom version e5a963efa66a1974127860b42e913d2374139ff5 (clean) (release) (start)

I did apt-get upgarde and sudo apt-get --with-new-pkgs upgrade and apt-get install can-utils Here what I get in syslog:

Jul 7 12:29:49 cilix-19 systemd[1]: Starting Permit User Sessions... Jul 7 12:29:49 cilix-19 systemd[1]: Finished Save/Restore Sound Card State. Jul 7 12:29:49 cilix-19 kernel: [ 7.795561] spi_master spi0: will run message pump with realtime priority Jul 7 12:29:49 cilix-19 systemd[1]: Started /etc/rc.local Compatibility. Jul 7 12:29:49 cilix-19 systemd[1]: Finished Permit User Sessions. Jul 7 12:29:49 cilix-19 kernel: [ 7.817746] mcp251xfd spi0.0 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized. Jul 7 12:29:49 cilix-19 kernel: [ 7.818144] spi_master spi1: will run message pump with realtime priority Jul 7 12:29:49 cilix-19 systemd[1]: Started User Login Management. Jul 7 12:29:49 cilix-19 systemd[1]: Condition check resulted in Manage Sound Card State (restore and store) being skipped. Jul 7 12:29:49 cilix-19 systemd[1]: Reached target Sound Card. Jul 7 12:29:49 cilix-19 systemd[1]: Started Getty on tty1. Jul 7 12:29:49 cilix-19 systemd[1]: Reached target Login Prompts. Jul 7 12:29:49 cilix-19 systemd[1]: Starting Load/Save RF Kill Switch Status... Jul 7 12:29:49 cilix-19 kernel: [ 7.855262] mcp251xfd spi1.0 (unnamed net_device) (uninitialized): Failed to detect MCP251xFD (osc=0x00000000). Jul 7 12:29:49 cilix-19 kernel: [ 7.861660] brcmfmac: brcmf_cfg80211_set_power_mgmt: power save enabled

Clearly my HAT is faulty especially for can1. That is very impressive since I never have used can1, nothing has been connected to it. I have no application using can1. Am I wrong or do you see something?

I had another device (always buy 2 when you need one) and from dmesg: [ 6.538390] Registered IR keymap rc-cec [ 6.561491] CAN device driver interface [ 6.575168] spi_master spi0: will run message pump with realtime priority [ 6.604233] rc rc0: vc4 as /devices/platform/soc/fef00700.hdmi/rc/rc0 [ 6.604836] input: vc4 as /devices/platform/soc/fef00700.hdmi/rc/rc0/input1 [ 6.645419] mcp251xfd spi0.0 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized. [ 6.646480] spi_master spi1: will run message pump with realtime priority [ 6.795480] vc4-drm gpu: bound fe400000.hvs (ops vc4_hvs_ops [vc4]) [ 6.798587] Registered IR keymap rc-cec [ 6.838292] rc rc0: vc4 as /devices/platform/soc/fef00700.hdmi/rc/rc0 [ 6.895713] input: vc4 as /devices/platform/soc/fef00700.hdmi/rc/rc0/input2 [ 6.945243] mcp251xfd spi1.0 can1: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized.

So it seems that the last is working but not the first. However I have one more test to do, because I am not fully convinced.

DavidBoJ commented 2 years ago

I waited a while then I changed controller back to the first and now that one is also working as seen from dmesg

[ 6.714808] CAN device driver interface [ 6.754550] spi_master spi0: will run message pump with realtime priority [ 6.812408] random: crng init done [ 6.812430] random: 7 urandom warning(s) missed due to ratelimiting [ 6.881806] mcp251xfd spi0.0 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CR C_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) succ essfully initialized. [ 6.893529] spi_master spi1: will run message pump with realtime priority [ 6.924229] mcp251xfd spi1.0 can1: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CR C_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) succ essfully initialized.

What we have is this: 1) something happens which makes the CAN controller faulty 2) The faulty state is remember so no reboot/power off/on change that state and can1 is not registered 3) Replace CAN controller with a new one 4) The new system works with the new CANbus controller 5) Switch back to the first CANbus controller 6) The system works again with the first CANbus controller

Can you explain that?

marckleinebudde commented 2 years ago

Do you use the same SD card? First let's get the both CAN interfaces detected properly, then do some tests between can0 and can1, finally add CODESYS.

To test between can0 and can1, connect both, make sure the bus is terminated, then:

canfdtest -v can0

and on another terminal:

canfdtest -vg can1

That should run without problems. Use Ctrl+c to abort after 1 hour or so.

Another test would be:

cansequence -rv can1

On another terminal (Edit: fixed interface name):

cangen can0 -Di -L1 -I2 -p10 -g 1

The -g parameter specifies the gap between CAN frames. You can decrease the number (i.e. to -g 0.1 or even -g 0) to increase the load. If you restart the cangen process, the receiver cansequence will print a single error message, that's OK. But there should be no other errors.

DavidBoJ commented 2 years ago

I created as described a new image on a new SD card. The first CANbus controller gave the disabling interrupt message so I concluded it indeed is faulty. Next I started the tests with the good controller. The first test with canfdtest worked fine. The next test I changed your command to (since I suppose I am not going to use a virtual network):

pi@cilix-19:~ $ pi@cilix-19:~ $ cangen can0 -Di -L1 -I2 -p10 -g 1

I get without errors not on the receiving interface either sequence wrap around .. I suppose everything is alright. Next, I want to see if my troublesome network kills the controller, and before I install CODESYS I just want to connect my network. My motor driver will only send BOOT messages I suppose. I have no application handling the messages, so I assume the controllers message buffer will be full, so maybe I will get some error messages maybe also CRC error messages, but nothing should be destroyed. I fear the problem is caused when I switch on the motor drivers so I will do that multiple times.

marckleinebudde commented 2 years ago

I created as described a new image on a new SD card. The first CANbus controller gave the disabling interrupt message so I concluded it indeed is faulty.

...or the config.txt is not correct. I think we'll soon have a proper waveshare overlay file for the rev2.1 boards, too.

Next I started the tests with the good controller. The first test with canfdtest worked fine. The next test I changed your command to (since I suppose I am not going to use a virtual network):

Doh! Right, I've fixed that.

pi@cilix-19:~ $ pi@cilix-19:~ $ cangen can0 -Di -L1 -I2 -p10 -g 1

I get without errors not on the receiving interface either sequence wrap around ..

Fine - sequence wrap around comes every 256 rx'ed CAN messages.

I suppose everything is alright. Next, I want to see if my troublesome network kills the controller, and before I install CODESYS I just want to connect my network. My motor driver will only send BOOT messages I suppose. I have no application handling the messages, so I assume the controllers message buffer will be full, so maybe I will get some error messages maybe also CRC error messages,

You should not get any CRC error messages from the driver in the kernel log. Maybe when you connect the CAN bus....

but nothing should be destroyed. I fear the problem is caused when I switch on the motor drivers so I will do that multiple times.

Ok - On my desk I've a setup that doesn't like when I plug one of my CAN-USB adapters to the USB port, results in CRC errors in the SPI communication.

Anyhow - If the driver in your setup goes reproducible into the Transmit Event FIFO buffer not empty loop we can think of a workaround. For proper debugging we need 2 CAN interfaces on the same bus, the 2nd one is for sniffing the bus.

DavidBoJ commented 2 years ago

I switch my motor driver on/off several times and my network doesn't seem to create any errors. I do not get "Transmit Event ..." I didn't get any CRC error either You can see my config.txt in one of the previous post. I will move on with CODESYS but without any application. Could I use can1 for sniffing? I have no other at the moment. I will order a proper USB can bus interface tomorrow.

DavidBoJ commented 2 years ago

I know we are using bcm2835 so I am concerned with these warnings seen with dmesg but I do not know all these modules are they in anyway related to CAN ?

[ 4.663823] snd_bcm2835: module is from the staging directory, the quality is unknown, you have been warned. [ 4.677381] videodev: Linux video capture interface: v2.00 [ 4.684988] bcm2835_vc_sm_cma_probe: Videocore shared memory driver

[ 4.687470] [vc_sm_connected_init]: installed successfully [ 4.693615] bcm2835_audio bcm2835_audio: card created with 8 channels [ 4.743658] bcm2835_mmal_vchiq: module is from the staging directory, the quality is unknown, you have been warned. [ 4.753904] bcm2835_mmal_vchiq: module is from the staging directory, the quality is unknown, you have been warned. [ 4.755651] bcm2835_codec: module is from the staging directory, the quality is unknown, you have been warned. [ 4.763710] bcm2835_mmal_vchiq: module is from the staging directory, the quality is unknown, you have been warned. [ 4.765499] bcm2835_isp: module is from the staging directory, the quality is unknown, you have been warned. [ 4.774756] bcm2835-codec bcm2835-codec: Device registered as /dev/video10 [ 4.774808] bcm2835-codec bcm2835-codec: Loaded V4L2 decode [ 4.788810] bcm2835-codec bcm2835-codec: Device registered as /dev/video11 [ 4.788861] bcm2835-codec bcm2835-codec: Loaded V4L2 encode [ 4.799079] bcm2835_v4l2: module is from the staging directory, the quality is unknown, you have been warned.

marckleinebudde commented 2 years ago

All unrelated to CAN. Should be no problem.

DavidBoJ commented 2 years ago

I have now installed my very simple application no code, no GUI In dmesg I have

[ 18.650204] can: controller area network core [ 18.650280] NET: Registered PF_CAN protocol family [ 18.658379] can: raw protocol [ 19.131987] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready [ 19.839699] mcp251xfd spi0.0 can0: CRC write command format error. [ 31.832449] cam-dummy-reg: disabling

I have no CANOpen SYNC enabled

The Pi which is master has real problem receiving a BOOTUP message from the slaves but eventually node 2 got in OPERATIONAL mode. For Node 1 the master gets timeout for the initialization with the SDO's it writes to the node. The two nodes are identical except for their id (same SDOs and PDO)

candump any,0~0,#FFFFFFFF -exdtA (2022-07-07 18:38:59.790332) can0 RX - - 20000004 [8] 00 08 00 00 00 00 6 0 00 ERRORFRAME controller-problem{tx-error-warning} error-counter-tx-rx{{96}{0}} (2022-07-07 18:38:59.790340) can0 RX - - 20000004 [8] 00 20 00 00 00 00 8 0 00 ERRORFRAME controller-problem{tx-error-passive} error-counter-tx-rx{{128}{0}} (2022-07-07 18:38:59.790344) can0 RX - - 20000004 [8] 00 08 00 00 00 00 7 F 00 ERRORFRAME controller-problem{tx-error-warning} error-counter-tx-rx{{127}{0}}

However after long time trying to boot node 1 I get (I didn't have any candump running unfortunately) Jul 7 18:23:33 cilix-19 bthelper[837]: Changing power off succeeded Jul 7 18:23:33 cilix-19 bthelper[648]: Changing power on succeeded Jul 7 18:23:33 cilix-19 kernel: [ 19.839699] mcp251xfd spi0.0 can0: CRC write command format error. Jul 7 18:23:50 cilix-19 systemd-timesyncd[735]: Initial synchronization to time server 152.115.59.245:123 (0.debian.pool.ntp.org). Jul 7 18:23:53 cilix-19 dhcpcd[737]: eth0: no IPv6 Routers available Jul 7 18:23:59 cilix-19 kernel: [ 31.832449] cam-dummy-reg: disabling Jul 7 18:24:03 cilix-19 systemd[1]: systemd-fsckd.service: Succeeded. Jul 7 18:24:12 cilix-19 systemd[1]: systemd-hostnamed.service: Succeeded. Jul 7 18:25:39 cilix-19 systemd[1]: Started Session 3 of user pi. Jul 7 18:29:12 cilix-19 kernel: [ 344.405975] IPv6: ADDRCONF(NETDEV_CHANGE): can1: link becomes ready Jul 7 18:31:52 cilix-19 systemd[1]: Started Session 4 of user pi. Jul 7 18:35:27 cilix-19 kernel: [ 719.634487] mcp251xfd spi0.0 can0: CRC write command format error. Jul 7 18:35:27 cilix-19 kernel: [ 719.635126] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005). Jul 7 18:35:27 cilix-19 kernel: [ 719.635218] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005). Jul 7 18:35:27 cilix-19 kernel: [ 719.635307] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005). Jul 7 18:35:27 cilix-19 kernel: [ 719.635396] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005). Jul 7 18:35:27 cilix-19 kernel: [ 719.635485] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005). Jul 7 18:35:27 cilix-19 kernel: [ 719.635574] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005). Jul 7 18:35:27 cilix-19 kernel: [ 719.635736] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005). Jul 7 18:35:27 cilix-19 kernel: [ 719.635825] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000000, tef_tail=0x00000004, tef_head=0x00000005, tx_head=0x00000005).

And so on. However my controller didn't seem to be faulty. I stopped codesys deleted the logs and I can still start a can1 network. It seems that codesys attempt to initialize a CAN node again and again brings it into state which eventually generates Transmit Event FIFO" so many that Pi stops functioning normally and maybe even destroys the controller in some cases? That fact can make it difficult to debug it would be good if the logging somehow could be limited Tomorrow I will do your test with cangen to be sure it still is working.

marckleinebudde commented 2 years ago

Does this happen during boot?

[ 18.650204] can: controller area network core
[ 18.650280] NET: Registered PF_CAN protocol family
[ 18.658379] can: raw protocol
[ 19.131987] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[ 19.839699] mcp251xfd spi0.0 can0: CRC write command format error.
[ 31.832449] cam-dummy-reg: disabling

Can you try this config.txt?

DavidBoJ commented 2 years ago

Yes, but remember I now have CODESYS installed and my application which sets up the can0 automatically at boot. I do not know if that has an impact on what dmesg shows. Interesting config.file I will try that tomorrow

DavidBoJ commented 2 years ago

My controller is still working testing it with cangen. However I have another system a waveshare carrier board with cm4 and isolated CAN 2.0B https://www.waveshare.com/wiki/Template:Compute_Module_4_PoE_4G_Module_Spec Here a snippet of /boot/config.txt: dtparam=i2c_arm=on

dtparam=i2s=on

dtparam=spi=on

CAN bus settings

dtoverlay=mcp2515-can0,oscillator=16000000,interrupt=25 dtoverlay=spi-bcm2835-overlay

It works right out of the box. But it is not Bullseye. pi@raspberrypi:/var/log$ uname -a Linux raspberrypi 5.4.51-v7l+ #1327 SMP Thu Jul 23 11:04:39 BST 2020 armv7l GNU/Linux

What do you think? Is mcp2515 more tolerant/robust than mcp251xfd? Is it Bullseye with its new socketCAN which causes the problem?

marckleinebudde commented 2 years ago

Please post your complete config.txt.

It works right out of the box. But it is not Bullseye.

Please post the error message.

What do you think? Is mcp2515 more tolerant/robust than mcp251xfd?

No

Is it Bullseye with its new socketCAN which causes the problem?

No

DavidBoJ commented 2 years ago

Here it is I renamed the file so we don't get confused with the mcp251xfd.
config_cm4.txt What error messages? There is no errors in the syslog related to mcp2515 or can0 and codesys with my simplified application works.

marckleinebudde commented 2 years ago

Okay - now try my config.txt from https://github.com/raspberrypi/linux/issues/5083#issuecomment-1178190229

DavidBoJ commented 2 years ago

I want to point out that the carrier board has good lightning-proof, and ESD protection. I think better than the HAT CAN FD controller and I am still concerned about a faulty signals from the motor driver.

In syslog I have:

Jul 8 11:17:36 cilix-19 kernel: [ 663.705295] mcp251xfd spi0.0 can0: CRC write command format error. Jul 8 11:17:36 cilix-19 kernel: [ 663.803021] mcp251xfd spi0.0 can0: CRC read error at address 0x0744 (length=28, data=00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 77 e7 e4 4e 00 00 00 00 00 00 00 00, CRC=0x0000) retrying. Jul 8 11:17:36 cilix-19 kernel: [ 663.805184] mcp251xfd spi0.0 can0: CRC write command format error. Jul 8 11:17:36 cilix-19 kernel: [ 664.003097] mcp251xfd spi0.0 can0: CRC read error at address 0x0698 (length=200, data=81 00 00 00 08 00 00 00 ef 9a 53 4f 00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 84 f1 53 4f 00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 4f 9f 54 4f 00 00 00 08 00 00 00 00 81 00 00 00, CRC=0x0000) retrying. Jul 8 11:17:36 cilix-19 kernel: [ 664.005392] mcp251xfd spi0.0 can0: CRC write command format error. Jul 8 11:17:36 cilix-19 kernel: [ 664.104171] mcp251xfd spi0.0 can0: CRC write command format error. Jul 8 11:17:37 cilix-19 kernel: [ 664.303960] mcp251xfd spi0.0 can0: CRC write command format error. Jul 8 11:17:37 cilix-19 kernel: [ 664.367022] mcp251xfd spi0.0 can0: RX-0: FIFO overflow. Jul 8 11:17:37 cilix-19 kernel: [ 664.403547] mcp251xfd spi0.0 can0: CRC read error at address 0x04e0 (length=252, data=81 00 00 00 08 00 00 00 a0 71 52 50 00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 35 c8 52 50 00 00 00 08 00 00 00 00 81 00 00 00 08 00 00 00 00 76 53 50 00 00 00 08 00 00 00 00 81 00 00 00, CRC=0x0000) retrying. Jul 8 11:17:37 cilix-19 kernel: [ 664.404910] mcp251xfd spi0.0 can0: CRC write command format error. Jul 8 11:17:37 cilix-19 kernel: [ 664.549939] mcp251xfd spi0.0 can0: RX-0: FIFO overflow ... ... Jul 8 11:20:06 cilix-19 kernel: [ 813.814254] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d). Jul 8 11:20:06 cilix-19 kernel: [ 813.814365] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d). Jul 8 11:20:06 cilix-19 kernel: [ 813.814477] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d). Jul 8 11:20:06 cilix-19 kernel: [ 813.814588] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d). Jul 8 11:20:06 cilix-19 kernel: [ 813.814700] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d). Jul 8 11:20:06 cilix-19 kernel: [ 813.814811] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d). Jul 8 11:20:06 cilix-19 kernel: [ 813.814923] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d). Jul 8 11:20:06 cilix-19 kernel: [ 813.815034] mcp251xfd spi0.0 can0: Transmit Event FIFO buffer not empty. (seq=0x00000002, tef_tail=0x0000000a, tef_head=0x0000000d, tx_head=0x0000000d).

and so on

marckleinebudde commented 2 years ago

Which config.txt have you used for that?

DavidBoJ commented 2 years ago

Your suggested config.txt on Pi 4

marckleinebudde commented 2 years ago

Ok. Next try: Please change use this in the config.txt:

dtoverlay=spi1-1cs-overlay,cs0_spidev=false
dtoverlay=mcp251xfd,spi0-0,interrupt=25,speed=10000000
dtoverlay=mcp251xfd,spi1-0,interrupt=24,speed=10000000

Please send your boot log, including the

MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized.

line.

Do you have a scope? Can you measure the frequency of the SPI-CLK line?

DavidBoJ commented 2 years ago

I have not tried out your last proposal (speed=10000000) because I discovered that your config.txt file didn't allow me enable a can1 network. I get pi@cilix-19:~ $ sudo ip link set can1 up type can bitrate 250000 Cannot find device "can1" So I reverted back to the original setup. Is it possible that the driver for mcp251xfd somehow expect something related to FD though I set up the network without FD on? The motor drivers do not support FD I think I will modify my carrier board so it run Bullseye with mcp2515 then I will do a cangen test with the Pi 4 wich uses mcp251xfd What about that?

DavidBoJ commented 2 years ago

I have now flashed my carrier board with Bullseye I have not CODESYS installed, so it is as simple as possible. Then I connected my Pi 4 with the FD HAT to the carrier board (The CAN bus). Next I tried to do a canfdtest Pi 4 canfdtest -v can0 Carrier board canfdtest -vg can0

It didn't work I got a lot of NNNNNNN... I used sudo ip link set can0 up type can bitrate 250000 to setup the network on Pi and carrier board. Conclusion:

For the time being the 2-CH CAN FD HAT Rev2.1 with the present driver cannot communicate with a device with MCP2515. Can it actually comply to CAN 2.0B?

marckleinebudde commented 2 years ago

FYI: See https://lore.kernel.org/all/9024B39B-CCDA-4E10-9A4E-70A4335F6304@baggywrinkle.co.uk/ for a discussion on setting the sjw to 50% of phase-seg2.