raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.2k stars 5.03k forks source link

[rpi zero 2w] Disconnect/connect USB devices on the fly leads to unrecoverable error in USB driver #6247

Open qrp73 opened 5 months ago

qrp73 commented 5 months ago

Steps to reproduce: 1) Use Raspberry Pi Zero 2W device with latest raspi os bookworm 64 Linux rpi2w 6.6.31+rpt-rpi-v8 raspberrypi/bookworm-feedback#1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux 2) Run dmesg -w to monitor kernel errors 3) Disconnect all devices from USB port 4) Connect USB keyboard 5) Disconnect USB keyboard 6) Repeat steps 4 & 5 in a loop until this error in dmesg output:

[  431.506128] usb 1-1: device descriptor read/64, error -71

UPDATED NOTE: connect/disconnect should be long enough to detect new connection, but not enough to read device info (see details below).

After that the system is unable to detect USB device connections anymore. Only reboot can fix it. But it still can accept ssh connection through WiFi.

It may take 10-30 connect/disconnect loops to catch the issue.

Device: Raspberry Pi Zero 2W Rev 1.0 Revision : 902120

$ cat /etc/rpi-issue
Raspberry Pi reference 2024-03-15
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, f19ee211ddafcae300827f953d143de92a5c6624, stage2

$ vcgencmd version
May 24 2024 15:31:28 
Copyright (c) 2012 Broadcom
version 4942b7633c0ff1af1ee95a51a33b56a9dae47529 (clean) (release) (start)

$ uname -a
Linux rpi2w 6.6.31+rpt-rpi-v8 raspberrypi/bookworm-feedback#1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

The log:

[  427.362039] Indeed it is in host mode hprt0 = 00041501
[  428.178199] Indeed it is in host mode hprt0 = 00021d01
[  428.506226] Indeed it is in host mode hprt0 = 00021d01
[  428.694165] usb 1-1: new full-speed USB device number 14 using dwc_otg
[  430.358619] Indeed it is in host mode hprt0 = 00021501
[  431.178152] Indeed it is in host mode hprt0 = 00021d01
[  431.506128] usb 1-1: device descriptor read/64, error -71
[  433.680703] Indeed it is in host mode hprt0 = 00021501
[  434.498169] Indeed it is in host mode hprt0 = 00021d01
[  434.826084] usb 1-1: device descriptor read/64, error -71
[  434.934131] usb usb1-port1: attempt power cycle
lurch commented 4 months ago
  1. Is there a reason why you need to constantly unplug and replug your USB keyboard 30 times? :wink:
  2. This sound like a Linux kernel issue. Could you transfer it please @P33M ?
qrp73 commented 4 months ago
  1. Is there a reason why you need to constantly unplug and replug your USB keyboard 30 times? 😉

Yes, this is very mandatory feature to be able connect and disconnect USB devices on the fly. This is a typical use-case for USB port.

Note that the issue happens with any kind of USB device. I'm used USB keyboard just as example for demonstration as a typical USB device. But the same issue happens with any other kind of USB devices, like USB mouse, USB sound card, USB flash drives, etc.

First I notice, that USB port stops to work after some period of usage. It happens when I worked with different USB devices, so I started investigation what is going on and I found that USB driver randomly going into unrecoverable error at USB device connect/disconnect time. And this issue can be easily reproduced with simple connect/disconnect USB device in a loop.

2. This sound like a [Linux kernel](https://github.com/raspberrypi/linux) issue. Could you transfer it please @P33M ?

I don't know how to transfer it between different repositories. How can I do it?

lurch commented 4 months ago

Thank you for the additional background details :+1:

I don't know how to transfer it between different repositories. How can I do it?

You can't - that's why I asked @P33M (who's one of our kernel developers) to transfer it :slightly_smiling_face:

pelwell commented 4 months ago

And does adding dtoverlay=dwc2,dr_mode=host make it more reliable? Note that I would expect a loss of performance with that configuration.

qrp73 commented 4 months ago

And does adding dtoverlay=dwc2,dr_mode=host make it more reliable? Note that I would expect a loss of performance with that configuration.

No, it don't helps. It still happens with this line in config.txt:

[   16.746657] systemd[691]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[   49.530105] usb 1-1: new full-speed USB device number 2 using dwc2
[   52.386155] usb 1-1: new full-speed USB device number 3 using dwc2
[   54.986123] usb 1-1: new full-speed USB device number 4 using dwc2
[   57.414126] usb 1-1: new full-speed USB device number 5 using dwc2
[   59.862129] usb 1-1: new full-speed USB device number 6 using dwc2
[   62.214135] usb 1-1: new full-speed USB device number 7 using dwc2
[   64.410126] usb 1-1: new full-speed USB device number 8 using dwc2
[   66.618129] usb 1-1: new full-speed USB device number 9 using dwc2
[   69.058149] usb 1-1: new full-speed USB device number 10 using dwc2
[   71.198127] usb 1-1: new full-speed USB device number 11 using dwc2
[   73.294130] usb 1-1: new full-speed USB device number 12 using dwc2
[   75.358170] usb 1-1: new full-speed USB device number 13 using dwc2
[   77.546156] usb 1-1: new full-speed USB device number 14 using dwc2
[   79.766176] usb 1-1: new full-speed USB device number 15 using dwc2
[   81.786202] usb 1-1: new full-speed USB device number 16 using dwc2
[   83.986221] usb 1-1: new full-speed USB device number 17 using dwc2
[   85.970239] usb 1-1: new full-speed USB device number 18 using dwc2
[   86.994249] usb 1-1: device descriptor read/64, error -71
[   87.102293] dwc2 3f980000.usb: Not connected
[   87.102336] dwc2 3f980000.usb: Not connected
[   87.102356] dwc2 3f980000.usb: Not connected
[   88.838260] usb 1-1: new full-speed USB device number 19 using dwc2
[   90.882299] usb 1-1: new full-speed USB device number 20 using dwc2
[   93.346285] usb 1-1: new full-speed USB device number 21 using dwc2
[   95.762293] usb 1-1: new full-speed USB device number 22 using dwc2
[  155.657248] usb 1-1: new full-speed USB device number 23 using dwc2
[  171.160539] usb 1-1: device descriptor read/64, error -110
[  186.776034] usb 1-1: device descriptor read/64, error -110
[  187.072011] usb 1-1: new full-speed USB device number 24 using dwc2
[  195.303805] usb 1-1: new full-speed USB device number 25 using dwc2
[  200.411900] dwc2 3f980000.usb: dwc2_hc_chhltd_intr_dma: Channel 3 - ChHltd set, but reason is unknown
[  200.411932] dwc2 3f980000.usb: hcint 0x00000002, intsts 0x04000029
[  200.411973] dwc2 3f980000.usb: dwc2_hc_chhltd_intr_dma: Channel 4 - ChHltd set, but reason is unknown
[  200.411994] dwc2 3f980000.usb: hcint 0x00000002, intsts 0x04000029
[  200.412029] dwc2 3f980000.usb: dwc2_hc_chhltd_intr_dma: Channel 5 - ChHltd set, but reason is unknown
[  200.412049] dwc2 3f980000.usb: hcint 0x00000002, intsts 0x04000029
[  201.059703] usb 1-1: new low-speed USB device number 26 using dwc2
[  203.024764] dwc2 3f980000.usb: dwc2_hc_chhltd_intr_dma: Channel 7 - ChHltd set, but reason is unknown
[  203.024795] dwc2 3f980000.usb: hcint 0x00000002, intsts 0x04000029
[  203.024834] dwc2 3f980000.usb: dwc2_hc_chhltd_intr_dma: Channel 0 - ChHltd set, but reason is unknown
[  203.024855] dwc2 3f980000.usb: hcint 0x00000002, intsts 0x04000029
[  203.027749] dwc2 3f980000.usb: Not connected
[  203.027794] dwc2 3f980000.usb: Not connected
[  204.115703] usb usb1-port1: attempt power cycle

But your line in config.txt eliminates other errors, like this one (this is the log taken without dtoverlay=dwc2,dr_mode=host):

[   17.326440] systemd[692]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[  107.927526] Indeed it is in host mode hprt0 = 00041d01
[  108.115436] usb 1-1: new low-speed USB device number 2 using dwc_otg

[  108.802692] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[  108.802714] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[  111.647392] Indeed it is in host mode hprt0 = 00021501
[  111.775384] Indeed it is in host mode hprt0 = 00021501
[  112.043387] Indeed it is in host mode hprt0 = 00021501
[  112.311382] Indeed it is in host mode hprt0 = 00021501
[  112.579378] Indeed it is in host mode hprt0 = 00021d01
[  115.119385] Indeed it is in host mode hprt0 = 00021d01
[  115.307322] usb 1-1: new full-speed USB device number 4 using dwc_otg

[  117.201496] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[  117.201533] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

But I found more details about the issue and why it may be hard to reproduce on different systems.

It appears that my USB OTG cable has a tricky bad contacts, when I plug device tightly to it, it appears that the device is connected for a very short time and then disconnected. And it appears that this issue of my USB OTG helps to catch the issue with unrecoverable error state in the driver.

Notice difference in the log taken with a good USB OTG with reliable contacts:

[   17.462401] systemd[694]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[  256.726770] Indeed it is in host mode hprt0 = 00041901
[  256.914666] usb 1-1: new low-speed USB device number 2 using dwc_otg
[  256.915587] Indeed it is in host mode hprt0 = 00041901
[  257.128826] usb 1-1: New USB device found, idVendor=1c4f, idProduct=0026, bcdDevice= 1.10
[  257.128854] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[  257.128863] usb 1-1: Product: USB Keyboard
[  257.128870] usb 1-1: Manufacturer: SIGMACHIP
[  257.134261] input: SIGMACHIP USB Keyboard as /devices/platform/soc/3f980000.usb/usb1/1-1/1-1:1.0/0003:1C4F:0026.0001/input/input2
[  257.195408] hid-generic 0003:1C4F:0026.0001: input,hidraw0: USB HID v1.10 Keyboard [SIGMACHIP USB Keyboard] on usb-3f980000.usb-1/input0
[  257.200558] input: SIGMACHIP USB Keyboard Consumer Control as /devices/platform/soc/3f980000.usb/usb1/1-1/1-1:1.1/0003:1C4F:0026.0002/input/input3
[  257.259078] input: SIGMACHIP USB Keyboard System Control as /devices/platform/soc/3f980000.usb/usb1/1-1/1-1:1.1/0003:1C4F:0026.0002/input/input4
[  257.259382] hid-generic 0003:1C4F:0026.0002: input,hidraw1: USB HID v1.10 Device [SIGMACHIP USB Keyboard] on usb-3f980000.usb-1/input1
[  258.566753] usb 1-1: USB disconnect, device number 2
[  259.702693] Indeed it is in host mode hprt0 = 00041901
[  259.890653] usb 1-1: new low-speed USB device number 3 using dwc_otg
[  259.891469] Indeed it is in host mode hprt0 = 00041901
[  260.104853] usb 1-1: New USB device found, idVendor=1c4f, idProduct=0026, bcdDevice= 1.10
[  260.104874] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[  260.104883] usb 1-1: Product: USB Keyboard
[  260.104890] usb 1-1: Manufacturer: SIGMACHIP
[  260.111813] input: SIGMACHIP USB Keyboard as /devices/platform/soc/3f980000.usb/usb1/1-1/1-1:1.0/0003:1C4F:0026.0003/input/input5
[  260.171326] hid-generic 0003:1C4F:0026.0003: input,hidraw0: USB HID v1.10 Keyboard [SIGMACHIP USB Keyboard] on usb-3f980000.usb-1/input0
[  260.176030] input: SIGMACHIP USB Keyboard Consumer Control as /devices/platform/soc/3f980000.usb/usb1/1-1/1-1:1.1/0003:1C4F:0026.0004/input/input6
[  260.235083] input: SIGMACHIP USB Keyboard System Control as /devices/platform/soc/3f980000.usb/usb1/1-1/1-1:1.1/0003:1C4F:0026.0004/input/input7
[  260.235360] hid-generic 0003:1C4F:0026.0004: input,hidraw1: USB HID v1.10 Device [SIGMACHIP USB Keyboard] on usb-3f980000.usb-1/input1
[  261.046730] usb 1-1: USB disconnect, device number 3
[  263.174719] Indeed it is in host mode hprt0 = 00041901
[  263.362642] usb 1-1: new low-speed USB device number 4 using dwc_otg
[  263.363536] Indeed it is in host mode hprt0 = 00041901
[  263.576770] usb 1-1: New USB device found, idVendor=1c4f, idProduct=0026, bcdDevice= 1.10
[  263.576790] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[  263.576799] usb 1-1: Product: USB Keyboard
[  263.576807] usb 1-1: Manufacturer: SIGMACHIP
[  263.588330] input: SIGMACHIP USB Keyboard as /devices/platform/soc/3f980000.usb/usb1/1-1/1-1:1.0/0003:1C4F:0026.0005/input/input8
[  263.651344] hid-generic 0003:1C4F:0026.0005: input,hidraw0: USB HID v1.10 Keyboard [SIGMACHIP USB Keyboard] on usb-3f980000.usb-1/input0
[  263.656171] input: SIGMACHIP USB Keyboard Consumer Control as /devices/platform/soc/3f980000.usb/usb1/1-1/1-1:1.1/0003:1C4F:0026.0006/input/input9
[  263.715061] input: SIGMACHIP USB Keyboard System Control as /devices/platform/soc/3f980000.usb/usb1/1-1/1-1:1.1/0003:1C4F:0026.0006/input/input10
[  263.715707] hid-generic 0003:1C4F:0026.0006: input,hidraw1: USB HID v1.10 Device [SIGMACHIP USB Keyboard] on usb-3f980000.usb-1/input1
[  267.990700] usb 1-1: USB disconnect, device number 4

As you can see, with USB OTG cable which has not reliable contacts, it detects USB connection, but does not have time to read information about the device, because it immediately disconnected in very short time. And when this happens many times, the driver is going into unrecoverable state... Once it happens, any further attempt to connect device with a good USB OTG cable don't helps - the driver remains in unrecoverable error state and stops to handle USB devices.

So, I think there is some issue in the driver which happens when it detects a new connection, but does not have time to read data about the new device because it quickly disconnects.

qrp73 commented 4 months ago

Some new finding. I'm able to reproduce the issue on a good USB OTG cable, but its' much more hard. It requires to connect/disconnect USB very shortly (connect it for half a second) to allow connection detect, but not allow read information about device (you can visially check it with dmesg -w).

With a good USB OTG cable it takes for about 117 connect/disconnect cycles to catch the issue (first 30-40 of which was with a long enough time to read device info):

[ 1469.821421] Indeed it is in host mode hprt0 = 00041901
[ 1471.176546] Indeed it is in host mode hprt0 = 00041901
[ 1471.996533] Indeed it is in host mode hprt0 = 00041901
[ 1472.324507] usb 1-1: new low-speed USB device number 112 using dwc_otg

[ 1472.324567] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[ 1472.324598] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[ 1472.324623] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[ 1473.656537] Indeed it is in host mode hprt0 = 00041901
[ 1475.144531] Indeed it is in host mode hprt0 = 00041901
[ 1476.144486] Indeed it is in host mode hprt0 = 00041901
[ 1476.332453] usb 1-1: new low-speed USB device number 115 using dwc_otg
[ 1476.333268] Indeed it is in host mode hprt0 = 00041901
[ 1477.152471] Indeed it is in host mode hprt0 = 00041901

[ 1477.480470] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[ 1477.688478] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[ 1478.240472] Indeed it is in host mode hprt0 = 00041901
[ 1479.132470] Indeed it is in host mode hprt0 = 00041901
[ 1479.460441] usb 1-1: new low-speed USB device number 116 using dwc_otg
[ 1479.461250] Indeed it is in host mode hprt0 = 00041901
[ 1480.280503] Indeed it is in host mode hprt0 = 00041901

[ 1480.628473] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[ 1480.628506] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[ 1480.628531] ERROR::dwc_otg_hcd_urb_enqueue:501: Not connected

[ 1480.628589] usb usb1-port1: attempt power cycle

after that USB driver going into unrecoverable error and stops to detect any further USB device connections. It just don't react on USB connections after that (device is not detected and no new log messages at all).

So the issue is definitely is related to not enough time to read device info when device connection happens.

qrp73 commented 4 months ago

Can you please fix it? This is important issue, because it leads to random unrecoverable error in USB drivers which makes any USB device unavailable until system reboot.

HectorPulido commented 6 days ago

I have the exact same issue, also usb 1-1: device descriptor read/64, error -71

lategoodbye commented 5 days ago

Hi, in order to investigate further this needs more specific information:

lategoodbye commented 2 days ago

I was able to reproduce this issue with Mainline 6.12 on a Raspberry Pi 3A+ and the real issue seems the power cycle for the root port. It seems that hub control isn't tested very often because most of the Pis have a builtin hub.

The following branch contains the relevant fix ( usb: dwc2: hcd: Fix GetPortStatus & SetPortFeature ): https://github.com/lategoodbye/linux-dev/commits/v6.12-dwc_hcd_fixes/

The dwc_otg driver seems to suffer from the same pattern.

I'm still a newbie to the dwc2 driver, so a review or at least test feedback is welcome.

popcornmix commented 1 day ago

@qrp73 are you able to build a kernel with @lategoodbye's patch? @P33M does the patch make sense to you?

P33M commented 19 hours ago

I guess so. Root port behaviours with the wacky OTG session negotiation aren't really tested as Pi board tend to force modes one way or the other - and usually the Zero boards are used with just the data socket providing power.

lategoodbye commented 19 hours ago

One of the reasons why I'm publishing these patches here is that the Mainline DWC2 maintainer from Synopsys wasn't active on linux-usb since July.

lategoodbye commented 14 hours ago

I added another fix ( usb: dwc2: Fix HCD port connection race ), which is relevant to @qrp73's use case.

popcornmix commented 13 hours ago

@qrp73 (and others) if you are willing to switch to a newer kernel (6.12) with lategoodbye's proposed patches you can run:

sudo rpi--update pulls/6496

Wait about 45mins for the build to complete.