mcp251xfd: candump not receiving messages immediately

vdh-robothania commented 1 day ago

Describe the bug

I am using a Waveshare 2-CH CAN FD HAT Rev2.1 with a Raspberry Pi 5 running Raspberry Pi OS. I have connected the CAN hat to a microcontroller that echoes anything back that is sent on the CAN bus.

The issue I am experiencing, is that messages aren't being received by candump (and socket read() in C) immediately. They are only received after several minutes or after sending multiple messages repeatedly.

I believe the messages might be stuck in a buffer/queue somewhere, as the hardware timestamps (see candump output below) indicates that messages are received at the expected time (but still not output until later).

Steps to reproduce the behaviour

Add the following to /boot/firmware/config.txt:

dtparam=spi=on
dtoverlay=spi1-3cs
dtoverlay=mcp251xfd,spi0-0,interrupt=25
dtoverlay=mcp251xfd,spi1-0,interrupt=24

Enable can0 interface: sudo ip link set can0 up type can bitrate 1000000 dbitrate 2000000 berr-reporting on fd on

Send FDCAN messages using cansend: cansend can0 234##31122334455667788

The following is the output of candump -xHt z can0 when executing cansend 8 times:

(000.000000)  can0  TX B E  234  [08]  11 22 33 44 55 66 77 88 <─┬─ output after cansend executed 2 times
(001.676943)  can0  TX B E  234  [08]  11 22 33 44 55 66 77 88 <─┘ 
(003.222309)  can0  TX B E  234  [08]  11 22 33 44 55 66 77 88 <─┬─ output after cansend executed 4 times
(004.647527)  can0  TX B E  234  [08]  11 22 33 44 55 66 77 88 <─┘ 
(006.132512)  can0  TX B E  234  [08]  11 22 33 44 55 66 77 88 <─┬─ output after cansend executed 6 times
(007.573627)  can0  TX B E  234  [08]  11 22 33 44 55 66 77 88 <─┘ 
(000.000084)  can0  RX B -  123  [08]  11 22 33 44 55 66 77 88 <─┬─ output after cansend executed 8 times
(001.677026)  can0  RX B -  123  [08]  11 22 33 44 55 66 77 88 <─┤ 
(003.222393)  can0  RX B -  123  [08]  11 22 33 44 55 66 77 88 <─┤ 
(004.647610)  can0  RX B -  123  [08]  11 22 33 44 55 66 77 88 <─┤ 
(006.132595)  can0  RX B -  123  [08]  11 22 33 44 55 66 77 88 <─┤ 
(007.573710)  can0  RX B -  123  [08]  11 22 33 44 55 66 77 88 <─┤ 
(009.042939)  can0  TX B E  234  [08]  11 22 33 44 55 66 77 88 <─┤ 
(009.043022)  can0  RX B -  123  [08]  11 22 33 44 55 66 77 88 <─┤ 
(010.332677)  can0  TX B E  234  [08]  11 22 33 44 55 66 77 88 <─┤ 
(010.332761)  can0  RX B -  123  [08]  11 22 33 44 55 66 77 88 <─┘

The device with ID 0x234 is the Raspberry Pi, while the device with ID 0x123 is the microcontroller. I would expect to receive one line with TX followed by one line with RX for each time cansend is executed, but this is not case.

Device (s)

Raspberry Pi 5

System

$ cat /etc/rpi-issue
Raspberry Pi reference 2024-07-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 48efb5fc5485fafdc9de8ad481eb5c09e1182656, stage4

$ vcgencmd version
2024/09/23 14:02:56
Copyright (c) 2012 Broadcom
version 26826259 (release) (embedded)

$ uname -a
Linux rp5 6.6.54-v8-16k+ #1801 SMP PREEMPT Fri Oct  4 16:57:16 BST 2024 aarch64 GNU/Linux

Logs

$ sudo dmesg | grep -E "can|spi|mcp2"
[    3.298649] spi_master spi0: will run message pump with realtime priority
[    3.303718] mcp251xfd spi0.0 can0: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[    3.303954] spi_master spi1: will run message pump with realtime priority
[    3.315472] mcp251xfd spi1.0 can1: MCP2518FD rev0.0 (-RX_INT -PLL -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD o:40.00MHz c:40.00MHz m:20.00MHz rs:17.00MHz es:16.66MHz rf:17.00MHz ef:16.66MHz) successfully initialized.
[  231.657637] can: controller area network core
[  231.662426] can: raw protocol

$ ip -details link show can0
3: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
    link/can  promiscuity 0  allmulti 0 minmtu 0 maxmtu 0
    can <BERR-REPORTING,FD> state ERROR-WARNING (berr-counter tx 0 rx 102) restart-ms 0
          bitrate 1000000 sample-point 0.750
          tq 25 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 5 brp 1
          mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp_inc 1
          dbitrate 2000000 dsample-point 0.750
          dtq 25 dprop-seg 7 dphase-seg1 7 dphase-seg2 5 dsjw 2 dbrp 1
          mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp_inc 1
          clock 40000000 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536 parentbus spi parentdev spi0.0

Additional context

No response

vdh-robothania commented 1 day ago

The issue is not present when using kernel 6.6.31. I haven't tried higher versions as I am experiencing the issue described in #6406

pelwell commented 1 day ago

Try after sudo rpi-update (but back up any important data first).

vdh-robothania commented 1 day ago

The issue is still present after doing sudo rpi-update (it's using hash 73e104f32e69e17450c443ebdb8bc493c849c38d) and rebooting

pelwell commented 1 day ago

Thanks - it's probably another aspect of the recent batch of changes that led to neighbour issue #6406.

Do any messages appear in the kernel log at the same time as that delay?

vdh-robothania commented 1 day ago

Using sudo dmesg -w, no messages appear at the same time as the delay

pelwell commented 1 day ago

OK, so it's probably not the same as this related problem: https://github.com/raspberrypi/linux/issues/6406#issuecomment-2402149659

marckleinebudde commented 1 day ago

Please compile https://github.com/linux-can/can-utils and run sudo mcp251xfd-dump spi0.0 after each cansend and send me the output. Also include the output of grep spi /proc/interrupts after each cansend.

pelwell commented 1 day ago

Compilation is really easy:

git clone https://github.com/linux-can/can-utils.git
make -C can-utils
sudo make -C can-utils install

vdh-robothania commented 1 day ago

output.zip

Here you go

marckleinebudde commented 1 day ago

Have you configured RX or TX IRQ coalescing? Show me the output of ethtool -g can0; ethtool -c can0;.

vdh-robothania commented 1 day ago

Ring parameters for can0:
Pre-set maximums:
RX:             24
RX Mini:        n/a
RX Jumbo:       n/a
TX:             8
Current hardware settings:
RX:             20
RX Mini:        n/a
RX Jumbo:       n/a
TX:             4
RX Buf Len:             n/a
CQE Size:               n/a
TX Push:        off
TCP data split: n/a
Coalesce parameters for can0:
Adaptive RX: n/a  TX: n/a
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: n/a
rx-frames: n/a
rx-usecs-irq: 0
rx-frames-irq: 8

tx-usecs: n/a
tx-frames: n/a
tx-usecs-irq: 0
tx-frames-irq: 2

rx-usecs-low: n/a
rx-frame-low: n/a
tx-usecs-low: n/a
tx-frame-low: n/a

rx-usecs-high: n/a
rx-frame-high: n/a
tx-usecs-high: n/a
tx-frame-high: n/a

CQE mode RX: n/a  TX: n/a

marckleinebudde commented 1 day ago

The system behaves as configured. You get an RX-IRQ after 8 RX'ed frames:

rx-frames-irq: 8

You probably want to configure a timeout for the RX- and TX-IRQ, e.g. 10ms: ethtool -C can0 rx-usecs-irq 10000 rx-frames-irq 8 tx-usecs-irq 10000 tx-frames-irq 2.

vdh-robothania commented 1 day ago

Thanks, that seems to be the issue. The values of tx-frames-irq and rx-frames-irq is set to 1 at boot, but automatically changes to 2 and 8 when executing sudo ip link set can0 up type can bitrate 1000000 dbitrate 2000000 berr-reporting on fd on.

I can set them back to 1 by setting the interface down, followed by sudo ethtool -C can0 rx-frames-irq 1 tx-frames-irq 1 and sudo ip link set can0 up type can bitrate 1000000 dbitrate 2000000 berr-reporting on fd on.

I just find it strange that the values stay as they are the second time I enable the interface, while they change the first time.

marckleinebudde commented 1 day ago

There is probably some 3rd program setting these values. You can add the following to the driver to generate a dump when someone sets the coalescing parameters. The stack strace should contain the program who called this.

--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-ethtool.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-ethtool.c
@@ -96,6 +96,8 @@ static int mcp251xfd_ring_set_coalesce(struct net_device *ndev,
         };
         struct can_ram_layout layout;

+        dump_stack();
+
         can_ram_get_layout(&layout, &mcp251xfd_ram_config, &ring, ec, fd_mode);

         if ((layout.rx_coalesce != priv->rx_obj_num_coalesce_irq ||

pelwell commented 1 day ago

The issue is not present when using kernel 6.6.31

Is that consistent with it being a configuration issue? Do older kernels ignore the configuration?

marckleinebudde commented 1 day ago

@pelwell Thanks for pointing this out. The coalescing configuration is supported since v5.18. There were some changes to the coalescing configuration code in the v6.6.x stable tree. I have to take a deeper look.

vdh-robothania commented 23 hours ago

There is probably some 3rd program setting these values. You can add the following to the driver to generate a dump when someone sets the coalescing parameters. The stack strace should contain the program who called this.
--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-ethtool.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-ethtool.c
@@ -96,6 +96,8 @@ static int mcp251xfd_ring_set_coalesce(struct net_device *ndev,
         };
         struct can_ram_layout layout;

+        dump_stack();
+
         can_ram_get_layout(&layout, &mcp251xfd_ram_config, &ring, ec, fd_mode);

         if ((layout.rx_coalesce != priv->rx_obj_num_coalesce_irq ||

I have added the dump_stack() to the driver as you said and reinstalled the kernel. This time using kernel 6.12, where the issue is still present.

$ uname -a
Linux rp5 6.12.0-rc2-v8-16k+ #2 SMP PREEMPT Wed Oct  9 17:56:11 CEST 2024 aarch64 GNU/Linux

$ sudo ethtool -c can0 | grep frames-irq
rx-frames-irq: 1
tx-frames-irq: 1

$ sudo ip link set can0 up type can bitrate 1000000 dbitrate 2000000 berr-reporting on fd on

$ sudo ethtool -c can0 | grep frames-irq
rx-frames-irq: 8
tx-frames-irq: 2

$ sudo ip link set can0 down

$ sudo ethtool -C can0 rx-frames-irq 1 tx-frames-irq 1

After executing the ethtool command I get the following output in dmesg:

[  330.983658] CPU: 1 UID: 0 PID: 2186 Comm: ethtool Tainted: G         C         6.12.0-rc2-v8-16k+ #2
[  330.983671] Tainted: [C]=CRAP
[  330.983673] Hardware name: Raspberry Pi 5 Model B Rev 1.0 (DT)
[  330.983675] Call trace:
[  330.983677]  dump_backtrace.part.0+0xe0/0x100
[  330.983686]  show_stack+0x20/0x40
[  330.983689]  dump_stack_lvl+0x60/0x80
[  330.983696]  dump_stack+0x18/0x28
[  330.983700]  mcp251xfd_ring_set_coalesce+0x7c/0x17a8 [mcp251xfd]
[  330.983712]  __ethnl_set_coalesce.isra.0+0x490/0x568
[  330.983718]  ethnl_set_coalesce+0x44/0xb8
[  330.983721]  ethnl_default_set_doit+0xc8/0x1b0
[  330.983726]  genl_family_rcv_msg_doit+0xe0/0x160
[  330.983732]  genl_rcv_msg+0x228/0x2b8
[  330.983735]  netlink_rcv_skb+0x68/0x140
[  330.983740]  genl_rcv+0x40/0x60
[  330.983746]  netlink_unicast+0x320/0x390
[  330.983751]  netlink_sendmsg+0x198/0x400
[  330.983756]  __sock_sendmsg+0x64/0xc0
[  330.983762]  __sys_sendto+0x10c/0x178
[  330.983767]  __arm64_sys_sendto+0x30/0x48
[  330.983771]  invoke_syscall+0x50/0x120
[  330.983777]  el0_svc_common.constprop.0+0x48/0xf0
[  330.983782]  do_el0_svc+0x24/0x38
[  330.983787]  el0_svc+0x34/0xe0
[  330.983792]  el0t_64_sync_handler+0x120/0x138
[  330.983797]  el0t_64_sync+0x190/0x198

$ sudo ip link set can0 up type can bitrate 1000000 dbitrate 2000000 berr-reporting on fd on

$ sudo ethtool -c can0 | grep frames-irq
rx-frames-irq: 1
tx-frames-irq: 1

So it seems that when calling sudo ip link set can0 up ... initially the tx-frames-irq and rx-frames-irq values get set without calling the mcp251xfd_ring_set_coalesce() function. Is it expected behavior that calling sudo ip link ... should set the tx-frames-irq and rx-frames-irq values on newer kernels?

marckleinebudde commented 22 hours ago

Thanks for testing, especially mainline.

If you see a single backtrace in the demsg output and only after the manual ethtool then there seems to be a problem with the kernel. I'll look into it, hopefully I find some time on Friday.

pelwell commented 22 hours ago

Thanks for helping with this, Marc.

raspberrypi / linux