linklayer / cantact

Drivers and Command Line Interface for CANtact tools
http://cantact.io
MIT License
46 stars 12 forks source link

"gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256" and "can-isotp: isotp_sendmsg: can_send_ret -105" #19

Closed brainstorm closed 2 years ago

brainstorm commented 3 years ago

After stressing CANTACT PRO a bit with a variant of this example but in a loop (I might post a code snippet/reproducer a bit later on):

https://github.com/marcelbuesing/socketcan-isotp/blob/master/examples/uds.rs

I've noticed the following messages in the kernel's dmesg and the interface itself stopped responding until I ran ifconfig can0 down && ifconfig can0 up again (cantact lights still on, unlike #18):

[   47.392807] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[38995.821664] can: controller area network core
[38995.821882] NET: Registered protocol family 29
[38995.901358] can: isotp protocol
[39852.713297] can: raw protocol
[473518.169804] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[473584.371852] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[473810.879016] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[473941.703153] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[474007.845229] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[474290.224146] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[474513.851197] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[474648.135451] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[474889.043020] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[474978.185816] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[475533.573321] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[475554.653991] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[475610.035753] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[475678.947905] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[475750.750761] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[476028.188951] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[476589.216530] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[476596.816784] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[476756.341799] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[476977.998761] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[477472.744351] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[477562.907198] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[477701.451518] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[477890.817515] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[477897.757707] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[478027.081779] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[478593.949573] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[478784.846110] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[478870.638308] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[479166.887624] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[479349.933427] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[479479.337485] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[479566.140243] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[479965.132743] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[479965.892780] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[480173.439339] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[483751.450487] can-isotp: isotp_sendmsg: can_send_ret -105
[486722.405717] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256
[486856.759953] gs_usb 1-1.3:1.0 can0: Unexpected out of range echo id -256

I'm running on a RPi1 with kernel:

Linux raspberrypi 5.10.17+ #1421 Thu May 27 13:58:02 BST 2021 armv6l GNU/Linux

Did you experience this in any of your tests with ISOTP payloads, @ericevenchick? @hartkopp Have you seen that -105 ret code and what could be the underlying issue?

And @HubertD or @marckleinebudde, have you seen those "Unexpected out of range echo id -256" in any other device(s) you've tested with your gs_usb lkm?

brainstorm commented 3 years ago

I can now consistently reproduce it... the symptom seems to be an out-of-order message and the whole can0 interface becomes unresponsive (see the amount of dropped packets):

132070108-6ddbdff0-9abf-4b33-9f84-20af1e1d35b7

So I guess that something about the URB handling in the USB firmware stack is off according to the message the gs_usb kernel driver gives?

Linux raspberrypi 5.10.17+ #1421 Thu May 27 13:58:02 BST 2021 armv6l GNU/Linux

After that happens, I can try sending any arbitrary CAN (ISOTP) message that used to work right moments before, but it doesn't show up in the CAN viewer... until ifconfig can0 down && ifconfig can0 up, that is.

hartkopp commented 3 years ago

The reason for this Unexpected out of range echo id -256 results from a weird entry inside a structure of a received USB message (urb buffer), as you already pointed out: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/can/usb/gs_usb.c?h=linux-5.10.y#n351

Can you post the output of ip -det link show can0 which is more informative than ifconfig can0 ?

Looks like something in the gs_usb firmware is out of sync. The isotp error is just a follow up error.

brainstorm commented 3 years ago

Thanks a ton for the response, @hartkopp, here's the ip output:

4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/can  promiscuity 0 minmtu 0 maxmtu 0
    can state ERROR-ACTIVE restart-ms 0
      bitrate 500000 sample-point 0.875
      tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
      gs_usb: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1
      clock 24000000numtxqueues 1 gso_max_size 65536 gso_max_segs 65535

It seems like @ericevenchick does have his own fork of gs_usb in https://github.com/linklayer/gs_usb_fd (presumably for CAN_FD), but I'm using the stock RaspberryPi_OS one instead... I hope that his firmware does not assume any special mods on the gs_usb kernel driver side.

In any case, I'd be happy to capture that faulty URB for you... what's the easiest way to debug kmods nowadays? Something similar to the following approach or are there easier debugging facilities in 2021?:

https://medium.com/@navaneethrvce/debugging-your-linux-kernel-module-21bf8a8728ba

hartkopp commented 3 years ago
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/can  promiscuity 0 minmtu 0 maxmtu 0
    can state ERROR-ACTIVE restart-ms 0
    bitrate 500000 sample-point 0.875
    tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
    gs_usb: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1
    clock 24000000numtxqueues 1 gso_max_size 65536 gso_max_segs 65535

That looks good so far. Is this before or after the CAN traffic was stuck?

It seems like @ericevenchick does have his own fork of gs_usb in https://github.com/linklayer/gs_usb_fd (presumably for CAN_FD), but I'm using the stock RaspberryPi_OS one instead... I hope that his firmware does not assume any special mods on the gs_usb kernel driver side.

Hm, when using he same USB ID the firmware should be compatible to the older driver.

In any case, I'd be happy to capture that faulty URB for you... what's the easiest way to debug kmods nowadays? Something similar to this approach or are there easier debugging facilities in 2021?:

Haha - in fact I have no clue about USB, sorry.

This is probably really something @ericevenchick should take a look at ;-)

brainstorm commented 3 years ago

That looks good so far. Is this before or after the CAN traffic was stuck?

My previous message was before the interface failed. The following is issued after it fails, but there are no changes in the ip output although the can0 interface just stops responding to new packets:

$ ip -det link show can0
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/can  promiscuity 0 minmtu 0 maxmtu 0
    can state ERROR-ACTIVE restart-ms 0
          bitrate 500000 sample-point 0.875
          tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
          gs_usb: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1
          clock 24000000numtxqueues 1 gso_max_size 65536 gso_max_segs 65535

This is probably really something @ericevenchick should take a look at ;-)

I soooo much agree ;) ... @ericevenchick ?

marckleinebudde commented 3 years ago

To debug usb, build a kernel with CONFIG_USB_MON enabled. Install tshark and capture from interface usbmonX, where X is some number. Depends probably on the port you're using. You can then analyze the dump with wireshark on you PC and post it here.

brainstorm commented 3 years ago

Hey, thanks @marckleinebudde, that's precisely what I started to do yesterday (captured a .pcap locally via usbmon and tcpdump). But admittedly, I'll have to read up a bit more on URBs to make sense of the ~100MB of USB capture session data and narrow down what I'm looking for exactly:

Screen Shot 2021-09-10 at 10 04 26 am

I guess that I'd also need to compile the gs_usb kmod with printks on the buffer for that URB structure when the error happens. Please do let me know if there are better shortcuts! ;)

brainstorm commented 3 years ago

Ok, I went ahead and started doing some hexdumps on the gs_usb_fd kernel module like so:

[1088072.338761] usbcore: registered new interface driver gs_usb_fd
[1088073.174292] gs_usb_fd HEALTHY URB sample: 7ddbb615: ff ff ff ff ee 6c fe 8c 08 00 00 00 00 00 c0 c1  .....l..........
[1088073.174330] gs_usb_fd HEALTHY URB sample: 35191e67: 00 ff 00 ff 00 00 00 00 00 00 00 00 00 00 00 00  ................
[1088073.174351] gs_usb_fd HEALTHY URB sample: caf4b257: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[1088073.174368] gs_usb_fd HEALTHY URB sample: fa43d5d7: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[1088073.174384] gs_usb_fd HEALTHY URB sample: e6dd00ee: 00 00 00 00 00 00 00 00 00 00 00 00              ............
[1088073.177148] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[1088132.885770] gs_usb_fd 1-1.3:1.0 can0: Unexpected out of range echo id -256
[1088132.896248] gs_usb_fd FAILED URB: 5eb8d0d3: 00 ff ff ff ee 6c fe 8c 08 00 00 00 00 00 c0 c1  .....l..........
[1088132.896319] gs_usb_fd FAILED URB: 0fa53d41: 00 ff 00 ff 00 00 00 00 00 00 00 00 00 00 00 00  ................
[1088132.896347] gs_usb_fd FAILED URB: 0ef9a330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[1088132.896373] gs_usb_fd FAILED URB: 80dabb9b: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[1088132.896398] gs_usb_fd FAILED URB: 71edc276: 00 00 00 00 00 00 00 00 00 00 00 00              ............

Those are changes on @ericevenchick's gs_usb_fd LKM fork, which is practically the same from the stock kernel, but just with some extra fields for CAN_FD, so I did a couple of simple printk-like changes and enabled debug:

root@raspberrypi:/home/pi/gs_usb_fd# git diff
diff --git a/Makefile b/Makefile
index 81fddd3..2430f04 100644
--- a/Makefile
+++ b/Makefile
@@ -3,9 +3,12 @@ obj-m := gs_usb_fd.o
 KDIR  := /lib/modules/$(shell uname -r)/build
 PWD   := $(shell pwd)
 LINDENT := /usr/src/kernels/$(shell uname -r)/scripts/Lindent
+CFLAGS += -g -DDEBUG
+ccflags-y += $(CFLAGS)
+CC += $(CFLAGS)

 default:
-       $(MAKE) -C $(KDIR) M=$(PWD) modules
+       $(MAKE) -C $(KDIR) M=$(PWD) modules EXTRA_CFLAGS="$(CFLAGS)"

 clean:
        $(MAKE) -C $(KDIR) M=$(PWD) clean
diff --git a/gs_usb_fd.c b/gs_usb_fd.c
index bb69477..1f12214 100644
--- a/gs_usb_fd.c
+++ b/gs_usb_fd.c
@@ -333,6 +333,8 @@ static void gs_usb_receive_bulk_callback(struct urb *urb)
                return;

        if (hf->echo_id == -1) {        /* normal rx */
+               // TODO: Match this dump with the type of URB transaction that fails further below, to compare apples with apples
+               DO_ONCE(print_hex_dump_bytes, "gs_usb_fd HEALTHY URB sample: ", DUMP_PREFIX_ADDRESS, urb->transfer_buffer, urb->transfer_buffer_length);
                if (hf->flags & GS_CAN_FLAG_FD) {
                        skb = alloc_canfd_skb(dev->netdev, &cfd);
                        if (!skb)
@@ -369,6 +371,7 @@ static void gs_usb_receive_bulk_callback(struct urb *urb)
                        netdev_err(netdev,
                                   "Unexpected out of range echo id %d\n",
                                   hf->echo_id);
+                       print_hex_dump_bytes("gs_usb_fd FAILED URB: ", DUMP_PREFIX_ADDRESS, urb->transfer_buffer, urb->transfer_buffer_length);
                        goto resubmit_urb;
                }

So, as seen above in the dmesg output, the TX buffer itself is identical between "healthy" and "failed" URBs. I'll compare the rest of the URB struct field(s) and see what's going on... I wish there was some sort of "pretty_print" in the kernel's URB code to make this a bit more userfriendly... let me know if you have a shortcut for that!

brainstorm commented 3 years ago

@marckleinebudde, happy to share the compressed ~12MB .pcap offline with you if needed. Could you please reach me at?:

brainstorm at nopcode dot org

For the dump below, I went for the hf (host frame) struct field data:

diff --git a/gs_usb_fd.c b/gs_usb_fd.c
index bb69477..8741830 100644
--- a/gs_usb_fd.c
+++ b/gs_usb_fd.c
@@ -333,6 +333,10 @@ static void gs_usb_receive_bulk_callback(struct urb *urb)
                return;

        if (hf->echo_id == -1) {        /* normal rx */
+               // TODO: Match this dump with the data type(s) that fail further below, to compare like with like
+               DO_ONCE(print_hex_dump_bytes, "gs_usb_fd HEALTHY host frame data: ", DUMP_PREFIX_NONE, hf->data, 4*64);
                if (hf->flags & GS_CAN_FLAG_FD) {
                        skb = alloc_canfd_skb(dev->netdev, &cfd);
                        if (!skb)
@@ -369,6 +373,14 @@ static void gs_usb_receive_bulk_callback(struct urb *urb)
                        netdev_err(netdev,
                                   "Unexpected out of range echo id %d\n",
                                   hf->echo_id);
+                       //print_hex_dump_bytes("gs_usb_fd FAILED URB: ", DUMP_PREFIX_ADDRESS, urb->transfer_buffer, urb->transfer_buffer_length);
+                       print_hex_dump_bytes("gs_usb_fd FAILED host frame data: ", DUMP_PREFIX_NONE, hf->data, 4*64);
                        goto resubmit_urb;
                }

Which yields the following hex dumps:

pi@raspberrypi:~ $ dmesg
[ 1975.380061] usbcore: deregistering interface driver gs_usb_fd
[ 1975.786829] gs_usb_fd 1-1.3:1.0: Configuring for 2 interfaces
[ 1975.805114] usbcore: registered new interface driver gs_usb_fd
[ 1976.652583] gs_usb_fd HEALTHY host frame data: 00 00 c0 c1 00 ff 00 ff 00 00 00 00 00 00 00 00  ................
[ 1976.652603] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652623] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652638] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652653] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652667] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652680] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652695] gs_usb_fd HEALTHY host frame data: 00 00 00 00 ff ff ff ff ee 6c fe 8c 08 00 00 00  .........l......
[ 1976.652707] gs_usb_fd HEALTHY host frame data: 00 00 c0 c1 00 ff 00 ff 00 00 00 00 00 00 00 00  ................
[ 1976.652721] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652737] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652752] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652765] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652780] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652792] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.652807] gs_usb_fd HEALTHY host frame data: 00 00 00 00 80 0e 00 00 00 00 00 00 00 00 00 00  ................
[ 1976.655893] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[ 2152.808549] gs_usb_fd 1-1.3:1.0 can0: Unexpected out of range echo id -256
[ 2152.808632] gs_usb_fd FAILED host frame data: 00 00 c0 c1 00 ff 00 ff 00 00 00 00 00 00 00 00  ................
[ 2152.808653] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808671] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808691] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808710] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808727] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808745] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808762] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 f9 ee da 98 05 00 00 00  ................
[ 2152.808780] gs_usb_fd FAILED host frame data: 04 27 02 50 61 00 00 00 00 00 00 00 00 00 00 00  .'.Pa...........
[ 2152.808798] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808818] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808836] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808853] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808871] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808888] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 2152.808906] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 f9 ee da 98 05 00 00 00  ................

But then again I'd need a bit more insight into the bytes from hr->data... also determining how the Cantact Pro firmware side comes into play with all this.... @ericevenchick, help? :_)

Cheers!

marckleinebudde commented 3 years ago

I guess that I'd also need to compile the gs_usb kmod with printks on the buffer for that URB structure when the error happens. Please do let me know if there are better shortcuts! ;)

The interesting stuff is in the "Leftover Capture Data", which is the actual data that the gs-usb driver receives.

marckleinebudde commented 3 years ago

Which driver are you using? The mainline one or the gs_usb_fd?

Looking at the mainline gs-ubs driver, there is the struct gs_host_frame:

struct gs_host_frame {
    u32 echo_id;
    __le32 can_id;

    u8 can_dlc;
    u8 channel;
    u8 flags;
    u8 reserved;

    u8 data[8];
} __packed;

the gs_usb_fd driver has a similar struct. The "Leftover Capture Data" of the wireshark dump is the struct gs_host_frame. The driver you have modified only prints the struct gs_host_frame::data, which is the CAN data. What's interesting here is the hf->echo_id (see: https://github.com/torvalds/linux/blob/master/drivers/net/can/usb/gs_usb.c#L353). The commented out part is what we need :)

brainstorm commented 3 years ago

Which driver are you using? The mainline one or the gs_usb_fd?

As I mentioned in https://github.com/linklayer/cantact/issues/19#issuecomment-916591001, I'm using Eric's CAN_FD fork, but it really doesn't have that many differences vs the mainline one... just using it because it already had the Makefile & co ready to go right after cloning it, but for the purposes of this debugging it shouldn't matter, I reckon? Happy to change it to mainline if you prefer, no worries ;)

What's interesting here is the hf->echo_id

Yeah, printed that before offline and it always is -256 when it fails with Unexpected out of range echo id -256 (predictably) and -1 otherwise, that's it.

It doesn't go wild with other values or anything like that, AFAICT... one thing I'm doing that might be slightly unusual is that I have "extended CAN IDs":

https://embedclogic.com/can-protocol-protocol-to-broadcast-message-on-a-network/standard-can-vs-extended-can/

Do you think that gs_usb could have issues with that? I'll send the data you requested over email soon, cheers!

marckleinebudde commented 3 years ago

I got the pcap file, let's extract the raw data from it:

tshark -r cantact-filtered.pcap -Y usb.capdata -T fields -e usb.capdata > cantact-filtered.raw

Now look for the -256 in the first 32 bit. -256 is 0xffffff00 and as a Little Endian byte stream it's 00ffffff. So look for lines starting with 00ffffff.

grep ^00ffffff cantact-filtered.raw

All those lines look like this:

00ffffffee6cfe8c080000000000c0c100ff00ff

You can now use

usb.capdata == 00:ff:ff:ff:ee:6c:fe:8c:08:00:00:00:00:00:c0:c1:00:ff:00:ff

as a filter for wireshark and look at the USB frames around it. I suppose the firmware has a bug sending these frames. Next steps would be to look at the firmware.

marckleinebudde commented 3 years ago

It doesn't go wild with other values or anything like that, AFAICT... one thing I'm doing that might be slightly unusual is that I have "extended CAN IDs":

I don't think it's a problem with extended CAN IDs, the problem is in the first 32 bit within the struct gs_host_frame, which is the u32 echo_id.

brainstorm commented 3 years ago

Ok, I've started looking at the firmware side after reproducing the same issue with a SocketCAN C snippet, just to discard it wasn't me doing something funky with Rust's socketcan-isotp.

I've published this repo with a reproducer: https://github.com/brainstorm/cantact-pro-fw-kaboom-reproducer

Which contains a usbmon USB traffic .PCAP, should @ericevenchick or anybody else wish to explore the root cause of this issue further.

And I've also started looking into the Cantact Pro firmware: https://github.com/linklayer/cantact-pro-fw/issues/3

While also finding out that without implementing DMA, there can be bad CAN side effects with the Cantact Pro's MCU, but I'm very much still learning:

https://community.nxp.com/t5/MCUXpresso-SDK/Basic-CAN-sniffer-monitor/m-p/1046188

... I might grab another CAN dongle that allows me to work on the main project though X"D

moechr commented 3 years ago

Hi there, we had the same issue. The problem comes from an erroneous behaviour of the µC. See errata sheet ES_LPC546xx 3.22 USB.15: USB high-speed device in endpoint TX data corruption.

Adding a dummy byte in the gs_host_frame struct in the linux driver solved the problem.

struct gs_host_frame {
    u32 echo_id;
    u32 can_id;

    u8 can_dlc;
    u8 channel;
    u8 flags;
    u8 reserved;

    u8 data[64];

    /* Workaround for "USB high-speed device in endpoint TX data corruption"
     * on NXP LPC546xx controllers (Errata sheet LPC546xx / USB.15)
     * This corruption occurs when the following conditions are met:
     * -> A TX (IN) transfer happens after a RX (OUT) transfer.
     * -> The RX (OUT) transfer length is 4 + N*16 (N>=0) bytes.
     * The current workaround is to add a dummy byte to the outgoing data to
     * break the second condition.
     */
    u8 errata_dummy;
} __packed;
brainstorm commented 3 years ago

Nice, good find @moechr, I'll test it with my setup... do you think it's easy to fix on the firmware side instead?

EDIT: from your mentioned errata:

Screen Shot 2021-09-22 at 8 25 24 pm

And it works, it doesn't hang anymore after a few ISOTP packets, thanks for the workaround! :)

moechr commented 3 years ago

the fix is easy - just transfer a number of bytes which is "violating" the rule mentioned in the errata as shown in my upper comment.

Instead of the workaround another µC-series could be used ;)

I think the problem also occurs with Rx transfer length 4+N*8.

brainstorm commented 3 years ago

The -256 error message on dmesg is gone so thanks for that @moechr, but it still occasionally still hangs for me after your patch, here are some details:

$ pgrep cancan | xargs -I % sudo cat /proc/%/stack
[<0>] __skb_wait_for_more_packets+0x140/0x180
[<0>] __skb_recv_datagram+0x84/0xd4
[<0>] skb_recv_datagram+0x48/0x68
[<0>] isotp_recvmsg+0x4c/0x204 [can_isotp]
[<0>] sock_recvmsg+0x50/0x54
[<0>] sock_read_iter+0xa4/0xf8
[<0>] vfs_read+0x304/0x338
[<0>] ksys_read+0xc0/0xd8
[<0>] sys_read+0x18/0x1c
[<0>] __entry_text_start+0x8/0x8
[<0>] 0xbee503f8

After this kernel backtrace/symptom, I have to bring down and up the can0 interface if I want to send/receive further packets at all.

I think the problem also occurs with Rx transfer length 4+N*8.

Did you address this one too somehow?

marckleinebudde commented 2 years ago

The "-256 error message" problem should be fixed with the GS_CAN_FEATURE_REQ_USB_QUIRK_LPC546XX quirk that will hit Linux mainline soon.

brainstorm commented 2 years ago

Faantastic, closing then, thanks everyone ;)