Closed brainstorm closed 2 years ago
I can now consistently reproduce it... the symptom seems to be an out-of-order message and the whole can0
interface becomes unresponsive (see the amount of dropped packets):
So I guess that something about the URB handling in the USB firmware stack is off according to the message the gs_usb
kernel driver gives?
Linux raspberrypi 5.10.17+ #1421 Thu May 27 13:58:02 BST 2021 armv6l GNU/Linux
After that happens, I can try sending any arbitrary CAN (ISOTP) message that used to work right moments before, but it doesn't show up in the CAN viewer... until ifconfig can0 down && ifconfig can0 up
, that is.
The reason for this Unexpected out of range echo id -256
results from a weird entry inside a structure of a received USB message (urb buffer), as you already pointed out:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/can/usb/gs_usb.c?h=linux-5.10.y#n351
Can you post the output of ip -det link show can0
which is more informative than ifconfig can0
?
Looks like something in the gs_usb firmware is out of sync. The isotp error is just a follow up error.
Thanks a ton for the response, @hartkopp, here's the ip output:
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/can promiscuity 0 minmtu 0 maxmtu 0
can state ERROR-ACTIVE restart-ms 0
bitrate 500000 sample-point 0.875
tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
gs_usb: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1
clock 24000000numtxqueues 1 gso_max_size 65536 gso_max_segs 65535
It seems like @ericevenchick does have his own fork of gs_usb
in https://github.com/linklayer/gs_usb_fd (presumably for CAN_FD), but I'm using the stock RaspberryPi_OS one instead... I hope that his firmware does not assume any special mods on the gs_usb
kernel driver side.
In any case, I'd be happy to capture that faulty URB for you... what's the easiest way to debug kmods nowadays? Something similar to the following approach or are there easier debugging facilities in 2021?:
https://medium.com/@navaneethrvce/debugging-your-linux-kernel-module-21bf8a8728ba
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/can promiscuity 0 minmtu 0 maxmtu 0 can state ERROR-ACTIVE restart-ms 0 bitrate 500000 sample-point 0.875 tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 gs_usb: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1 clock 24000000numtxqueues 1 gso_max_size 65536 gso_max_segs 65535
That looks good so far. Is this before or after the CAN traffic was stuck?
It seems like @ericevenchick does have his own fork of
gs_usb
in https://github.com/linklayer/gs_usb_fd (presumably for CAN_FD), but I'm using the stock RaspberryPi_OS one instead... I hope that his firmware does not assume any special mods on thegs_usb
kernel driver side.
Hm, when using he same USB ID the firmware should be compatible to the older driver.
In any case, I'd be happy to capture that faulty URB for you... what's the easiest way to debug kmods nowadays? Something similar to this approach or are there easier debugging facilities in 2021?:
Haha - in fact I have no clue about USB, sorry.
This is probably really something @ericevenchick should take a look at ;-)
That looks good so far. Is this before or after the CAN traffic was stuck?
My previous message was before the interface failed. The following is issued after it fails, but there are no changes in the ip
output although the can0
interface just stops responding to new packets:
$ ip -det link show can0
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/can promiscuity 0 minmtu 0 maxmtu 0
can state ERROR-ACTIVE restart-ms 0
bitrate 500000 sample-point 0.875
tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
gs_usb: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1
clock 24000000numtxqueues 1 gso_max_size 65536 gso_max_segs 65535
This is probably really something @ericevenchick should take a look at ;-)
I soooo much agree ;) ... @ericevenchick ?
To debug usb, build a kernel with CONFIG_USB_MON
enabled. Install tshark
and capture from interface usbmonX
, where X is some number. Depends probably on the port you're using. You can then analyze the dump with wireshark
on you PC and post it here.
Hey, thanks @marckleinebudde, that's precisely what I started to do yesterday (captured a .pcap locally via usbmon and tcpdump). But admittedly, I'll have to read up a bit more on URBs to make sense of the ~100MB of USB capture session data and narrow down what I'm looking for exactly:
I guess that I'd also need to compile the gs_usb
kmod with printk
s on the buffer for that URB structure when the error happens. Please do let me know if there are better shortcuts! ;)
Ok, I went ahead and started doing some hexdumps on the gs_usb_fd
kernel module like so:
[1088072.338761] usbcore: registered new interface driver gs_usb_fd
[1088073.174292] gs_usb_fd HEALTHY URB sample: 7ddbb615: ff ff ff ff ee 6c fe 8c 08 00 00 00 00 00 c0 c1 .....l..........
[1088073.174330] gs_usb_fd HEALTHY URB sample: 35191e67: 00 ff 00 ff 00 00 00 00 00 00 00 00 00 00 00 00 ................
[1088073.174351] gs_usb_fd HEALTHY URB sample: caf4b257: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[1088073.174368] gs_usb_fd HEALTHY URB sample: fa43d5d7: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[1088073.174384] gs_usb_fd HEALTHY URB sample: e6dd00ee: 00 00 00 00 00 00 00 00 00 00 00 00 ............
[1088073.177148] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[1088132.885770] gs_usb_fd 1-1.3:1.0 can0: Unexpected out of range echo id -256
[1088132.896248] gs_usb_fd FAILED URB: 5eb8d0d3: 00 ff ff ff ee 6c fe 8c 08 00 00 00 00 00 c0 c1 .....l..........
[1088132.896319] gs_usb_fd FAILED URB: 0fa53d41: 00 ff 00 ff 00 00 00 00 00 00 00 00 00 00 00 00 ................
[1088132.896347] gs_usb_fd FAILED URB: 0ef9a330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[1088132.896373] gs_usb_fd FAILED URB: 80dabb9b: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[1088132.896398] gs_usb_fd FAILED URB: 71edc276: 00 00 00 00 00 00 00 00 00 00 00 00 ............
Those are changes on @ericevenchick's gs_usb_fd LKM fork, which is practically the same from the stock kernel, but just with some extra fields for CAN_FD, so I did a couple of simple printk-like changes and enabled debug:
root@raspberrypi:/home/pi/gs_usb_fd# git diff
diff --git a/Makefile b/Makefile
index 81fddd3..2430f04 100644
--- a/Makefile
+++ b/Makefile
@@ -3,9 +3,12 @@ obj-m := gs_usb_fd.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
LINDENT := /usr/src/kernels/$(shell uname -r)/scripts/Lindent
+CFLAGS += -g -DDEBUG
+ccflags-y += $(CFLAGS)
+CC += $(CFLAGS)
default:
- $(MAKE) -C $(KDIR) M=$(PWD) modules
+ $(MAKE) -C $(KDIR) M=$(PWD) modules EXTRA_CFLAGS="$(CFLAGS)"
clean:
$(MAKE) -C $(KDIR) M=$(PWD) clean
diff --git a/gs_usb_fd.c b/gs_usb_fd.c
index bb69477..1f12214 100644
--- a/gs_usb_fd.c
+++ b/gs_usb_fd.c
@@ -333,6 +333,8 @@ static void gs_usb_receive_bulk_callback(struct urb *urb)
return;
if (hf->echo_id == -1) { /* normal rx */
+ // TODO: Match this dump with the type of URB transaction that fails further below, to compare apples with apples
+ DO_ONCE(print_hex_dump_bytes, "gs_usb_fd HEALTHY URB sample: ", DUMP_PREFIX_ADDRESS, urb->transfer_buffer, urb->transfer_buffer_length);
if (hf->flags & GS_CAN_FLAG_FD) {
skb = alloc_canfd_skb(dev->netdev, &cfd);
if (!skb)
@@ -369,6 +371,7 @@ static void gs_usb_receive_bulk_callback(struct urb *urb)
netdev_err(netdev,
"Unexpected out of range echo id %d\n",
hf->echo_id);
+ print_hex_dump_bytes("gs_usb_fd FAILED URB: ", DUMP_PREFIX_ADDRESS, urb->transfer_buffer, urb->transfer_buffer_length);
goto resubmit_urb;
}
So, as seen above in the dmesg
output, the TX buffer itself is identical between "healthy" and "failed" URBs. I'll compare the rest of the URB struct field(s) and see what's going on... I wish there was some sort of "pretty_print" in the kernel's URB code to make this a bit more userfriendly... let me know if you have a shortcut for that!
@marckleinebudde, happy to share the compressed ~12MB .pcap offline with you if needed. Could you please reach me at?:
brainstorm at nopcode dot org
For the dump below, I went for the hf
(host frame) struct field data
:
diff --git a/gs_usb_fd.c b/gs_usb_fd.c
index bb69477..8741830 100644
--- a/gs_usb_fd.c
+++ b/gs_usb_fd.c
@@ -333,6 +333,10 @@ static void gs_usb_receive_bulk_callback(struct urb *urb)
return;
if (hf->echo_id == -1) { /* normal rx */
+ // TODO: Match this dump with the data type(s) that fail further below, to compare like with like
+ DO_ONCE(print_hex_dump_bytes, "gs_usb_fd HEALTHY host frame data: ", DUMP_PREFIX_NONE, hf->data, 4*64);
if (hf->flags & GS_CAN_FLAG_FD) {
skb = alloc_canfd_skb(dev->netdev, &cfd);
if (!skb)
@@ -369,6 +373,14 @@ static void gs_usb_receive_bulk_callback(struct urb *urb)
netdev_err(netdev,
"Unexpected out of range echo id %d\n",
hf->echo_id);
+ //print_hex_dump_bytes("gs_usb_fd FAILED URB: ", DUMP_PREFIX_ADDRESS, urb->transfer_buffer, urb->transfer_buffer_length);
+ print_hex_dump_bytes("gs_usb_fd FAILED host frame data: ", DUMP_PREFIX_NONE, hf->data, 4*64);
goto resubmit_urb;
}
Which yields the following hex dumps:
pi@raspberrypi:~ $ dmesg
[ 1975.380061] usbcore: deregistering interface driver gs_usb_fd
[ 1975.786829] gs_usb_fd 1-1.3:1.0: Configuring for 2 interfaces
[ 1975.805114] usbcore: registered new interface driver gs_usb_fd
[ 1976.652583] gs_usb_fd HEALTHY host frame data: 00 00 c0 c1 00 ff 00 ff 00 00 00 00 00 00 00 00 ................
[ 1976.652603] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652623] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652638] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652653] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652667] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652680] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652695] gs_usb_fd HEALTHY host frame data: 00 00 00 00 ff ff ff ff ee 6c fe 8c 08 00 00 00 .........l......
[ 1976.652707] gs_usb_fd HEALTHY host frame data: 00 00 c0 c1 00 ff 00 ff 00 00 00 00 00 00 00 00 ................
[ 1976.652721] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652737] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652752] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652765] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652780] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652792] gs_usb_fd HEALTHY host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.652807] gs_usb_fd HEALTHY host frame data: 00 00 00 00 80 0e 00 00 00 00 00 00 00 00 00 00 ................
[ 1976.655893] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[ 2152.808549] gs_usb_fd 1-1.3:1.0 can0: Unexpected out of range echo id -256
[ 2152.808632] gs_usb_fd FAILED host frame data: 00 00 c0 c1 00 ff 00 ff 00 00 00 00 00 00 00 00 ................
[ 2152.808653] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808671] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808691] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808710] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808727] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808745] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808762] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 f9 ee da 98 05 00 00 00 ................
[ 2152.808780] gs_usb_fd FAILED host frame data: 04 27 02 50 61 00 00 00 00 00 00 00 00 00 00 00 .'.Pa...........
[ 2152.808798] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808818] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808836] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808853] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808871] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808888] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 2152.808906] gs_usb_fd FAILED host frame data: 00 00 00 00 00 00 00 00 f9 ee da 98 05 00 00 00 ................
But then again I'd need a bit more insight into the bytes from hr->data
... also determining how the Cantact Pro firmware side comes into play with all this.... @ericevenchick, help? :_)
Cheers!
I guess that I'd also need to compile the gs_usb kmod with printks on the buffer for that URB structure when the error happens. Please do let me know if there are better shortcuts! ;)
The interesting stuff is in the "Leftover Capture Data", which is the actual data that the gs-usb driver receives.
Which driver are you using? The mainline one or the gs_usb_fd
?
Looking at the mainline gs-ubs driver, there is the struct gs_host_frame
:
struct gs_host_frame {
u32 echo_id;
__le32 can_id;
u8 can_dlc;
u8 channel;
u8 flags;
u8 reserved;
u8 data[8];
} __packed;
the gs_usb_fd
driver has a similar struct. The "Leftover Capture Data" of the wireshark dump is the struct gs_host_frame
. The driver you have modified only prints the struct gs_host_frame::data
, which is the CAN data. What's interesting here is the hf->echo_id
(see: https://github.com/torvalds/linux/blob/master/drivers/net/can/usb/gs_usb.c#L353). The commented out part is what we need :)
Which driver are you using? The mainline one or the gs_usb_fd?
As I mentioned in https://github.com/linklayer/cantact/issues/19#issuecomment-916591001, I'm using Eric's CAN_FD fork, but it really doesn't have that many differences vs the mainline one... just using it because it already had the Makefile & co ready to go right after cloning it, but for the purposes of this debugging it shouldn't matter, I reckon? Happy to change it to mainline if you prefer, no worries ;)
What's interesting here is the hf->echo_id
Yeah, printed that before offline and it always is -256
when it fails with Unexpected out of range echo id -256
(predictably) and -1
otherwise, that's it.
It doesn't go wild with other values or anything like that, AFAICT... one thing I'm doing that might be slightly unusual is that I have "extended CAN IDs":
Do you think that gs_usb
could have issues with that? I'll send the data you requested over email soon, cheers!
I got the pcap file, let's extract the raw data from it:
tshark -r cantact-filtered.pcap -Y usb.capdata -T fields -e usb.capdata > cantact-filtered.raw
Now look for the -256
in the first 32 bit. -256
is 0xffffff00
and as a Little Endian byte stream it's 00ffffff
. So look for lines starting with 00ffffff
.
grep ^00ffffff cantact-filtered.raw
All those lines look like this:
00ffffffee6cfe8c080000000000c0c100ff00ff
You can now use
usb.capdata == 00:ff:ff:ff:ee:6c:fe:8c:08:00:00:00:00:00:c0:c1:00:ff:00:ff
as a filter for wireshark and look at the USB frames around it. I suppose the firmware has a bug sending these frames. Next steps would be to look at the firmware.
It doesn't go wild with other values or anything like that, AFAICT... one thing I'm doing that might be slightly unusual is that I have "extended CAN IDs":
I don't think it's a problem with extended CAN IDs, the problem is in the first 32 bit within the struct gs_host_frame
, which is the u32 echo_id
.
Ok, I've started looking at the firmware side after reproducing the same issue with a SocketCAN C snippet, just to discard it wasn't me doing something funky with Rust's socketcan-isotp.
I've published this repo with a reproducer: https://github.com/brainstorm/cantact-pro-fw-kaboom-reproducer
Which contains a usbmon USB traffic .PCAP, should @ericevenchick or anybody else wish to explore the root cause of this issue further.
And I've also started looking into the Cantact Pro firmware: https://github.com/linklayer/cantact-pro-fw/issues/3
While also finding out that without implementing DMA, there can be bad CAN side effects with the Cantact Pro's MCU, but I'm very much still learning:
https://community.nxp.com/t5/MCUXpresso-SDK/Basic-CAN-sniffer-monitor/m-p/1046188
... I might grab another CAN dongle that allows me to work on the main project though X"D
Hi there, we had the same issue. The problem comes from an erroneous behaviour of the µC. See errata sheet ES_LPC546xx 3.22 USB.15: USB high-speed device in endpoint TX data corruption.
Adding a dummy byte in the gs_host_frame struct in the linux driver solved the problem.
struct gs_host_frame {
u32 echo_id;
u32 can_id;
u8 can_dlc;
u8 channel;
u8 flags;
u8 reserved;
u8 data[64];
/* Workaround for "USB high-speed device in endpoint TX data corruption"
* on NXP LPC546xx controllers (Errata sheet LPC546xx / USB.15)
* This corruption occurs when the following conditions are met:
* -> A TX (IN) transfer happens after a RX (OUT) transfer.
* -> The RX (OUT) transfer length is 4 + N*16 (N>=0) bytes.
* The current workaround is to add a dummy byte to the outgoing data to
* break the second condition.
*/
u8 errata_dummy;
} __packed;
Nice, good find @moechr, I'll test it with my setup... do you think it's easy to fix on the firmware side instead?
EDIT: from your mentioned errata:
And it works, it doesn't hang anymore after a few ISOTP packets, thanks for the workaround! :)
the fix is easy - just transfer a number of bytes which is "violating" the rule mentioned in the errata as shown in my upper comment.
Instead of the workaround another µC-series could be used ;)
I think the problem also occurs with Rx transfer length 4+N*8.
The -256 error message on dmesg is gone so thanks for that @moechr, but it still occasionally still hangs for me after your patch, here are some details:
$ pgrep cancan | xargs -I % sudo cat /proc/%/stack
[<0>] __skb_wait_for_more_packets+0x140/0x180
[<0>] __skb_recv_datagram+0x84/0xd4
[<0>] skb_recv_datagram+0x48/0x68
[<0>] isotp_recvmsg+0x4c/0x204 [can_isotp]
[<0>] sock_recvmsg+0x50/0x54
[<0>] sock_read_iter+0xa4/0xf8
[<0>] vfs_read+0x304/0x338
[<0>] ksys_read+0xc0/0xd8
[<0>] sys_read+0x18/0x1c
[<0>] __entry_text_start+0x8/0x8
[<0>] 0xbee503f8
After this kernel backtrace/symptom, I have to bring down and up the can0
interface if I want to send/receive further packets at all.
I think the problem also occurs with Rx transfer length 4+N*8.
Did you address this one too somehow?
The "-256 error message" problem should be fixed with the GS_CAN_FEATURE_REQ_USB_QUIRK_LPC546XX
quirk that will hit Linux mainline soon.
Faantastic, closing then, thanks everyone ;)
After stressing CANTACT PRO a bit with a variant of this example but in a loop (I might post a code snippet/reproducer a bit later on):
https://github.com/marcelbuesing/socketcan-isotp/blob/master/examples/uds.rs
I've noticed the following messages in the kernel's dmesg and the interface itself stopped responding until I ran
ifconfig can0 down && ifconfig can0 up
again (cantact lights still on, unlike #18):I'm running on a RPi1 with kernel:
Linux raspberrypi 5.10.17+ #1421 Thu May 27 13:58:02 BST 2021 armv6l GNU/Linux
Did you experience this in any of your tests with ISOTP payloads, @ericevenchick? @hartkopp Have you seen that
-105
ret code and what could be the underlying issue?And @HubertD or @marckleinebudde, have you seen those "Unexpected out of range echo id -256" in any other device(s) you've tested with your
gs_usb
lkm?