raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.08k stars 4.97k forks source link

Kernel Issues on Idle RPi4s #3919

Closed josefcub closed 3 years ago

josefcub commented 3 years ago

Hello!

I've got a small network of Raspberry Pi 4 4GB models, all running the 2020-08-20 64-bit beta "Lite" image, and doing various jobs from being a Kubernetes cluster to supporting the cluster and a small, private, network attached to it. Since I started running the 64-bit image, I've noticed that my Pi4s that are -not- doing a lot (specifically my sandbox, the Pi running DNS/DHCP, and the the repository mirror and docker registry Pi) will occasionally go off the rails and produce this:

Oct 16 19:53:58 backup kernel: [1154789.177691] xhci_hcd 0000:01:00.0: Ring expansion failed
Oct 16 19:53:58 backup kernel: [1154789.177766] cdc_acm 1-1.3:1.0: acm_port_activate - usb_submit_urb(ctrl irq) failed
Oct 16 19:53:58 backup led-control[32074]: /usr/local/bin/led-control: line 43: /dev/ttyACM0: Cannot allocate memory
Oct 16 19:53:58 backup kernel: [1154789.182799] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (-12 86)
Oct 16 19:53:58 backup kernel: [1154789.184201] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (-12 86)
Oct 16 19:53:58 backup kernel: [1154789.184332] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 4 state (-12 84)
Oct 16 19:53:59 backup kernel: [1154789.226302] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (-12 86)
Oct 16 19:53:59 backup kernel: [1154789.227686] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (-12 86)
Oct 16 19:53:59 backup kernel: [1154789.227810] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 4 state (-12 84)
Oct 16 19:53:59 backup kernel: [1154789.241425] cpu cpu0: dev_pm_opp_set_rate: failed to find current OPP for freq 18446744073709551604 (-34)

Those messages repeat many times per second until I reboot the affected Pi.

The reboot does clear up the issue for anywhere from 8 to 36 hours (aside from the hundreds of megabytes of log if I haven't paid attention in a while). Is this a known issue?

Of all of the Pis in the network and cluster, the three Pis that do this with regularity are the ones that are the most idle. The database server, the NFS server, and Kubernetes masters and workers are constantly busy, and very rarely do this (and haven't in a month or more since I stood Kubernetes up). It's the only difference between the Pis that I can find, as they are all nearly identical (differing only in SSD size).

All Pis are software and firmware up-to-date at this time, as I do boot directly from SSD on all of them.

Swapping the PIs around (a kubernetes worker Pi for my sandbox Pi, for instance) results in the new sandbox Pi exhibiting the same behavior eventually.

The Pis ran the 32-bit Raspbian image doing the same things without issue.

Would any of you have an idea of what's going on here?

Hardware and kernel information:

Raspberry Pi 4B, 4GB

$ cat /proc/version
Linux version 5.4.51-v8+ (dom@buildbot) (gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9)) #1333 SMP PREEMPT Mon Aug 10 16:58:35 BST 2020

$ cat /etc/debian_version
10.6

$ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 10)
01:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)

$ lsusb
Bus 002 Device 002: ID 174c:55aa ASMedia Technology Inc. Name: ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge, ASM1153E SATA 6Gb/s bridge
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 003: ID 2047:03df Texas Instruments
Bus 001 Device 002: ID 2109:3431 VIA Labs, Inc. Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

$ vcgencmd bootloader_version
Sep  3 2020 13:11:43
version c305221a6d7e532693cc7ff57fddfc8649def167 (release)
timestamp 1599135103

I can provide as much information as you require, just let me know what you need.

Thank you!

lurch commented 3 years ago

pinging @P33M as this looks like it might be a USB issue. What does the led-control script do?

josefcub commented 3 years ago

pinging @P33M as this looks like it might be a USB issue. What does the led-control script do?

It's this: https://pastebin.com/DT0c3c5R

I cobbled together a quick pair of scripts that control the USB RGB LED. It writes a hex string to /dev/ttyACM0 once a second. I do recognize that I could probably throw a second sleep in the main loop (among many other optimizations), though system load caused by this script is quite minimal. All of the Pis, both affected and not, run it.

The second script merely alters the contents of /tmp/led_color according to the system load.

P33M commented 3 years ago

Hm. Why is xhci trying to expand the ring for an endpoint that gets poked with minimal data?

Can you dump out the ring sizes for the device in question? Preferably when the ring expansion has failed at least once.

Navigate (as root) to /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/[DeviceN]/ where DeviceN is the slot ID (not USB address) of your device - 01 will be the built-in USB2 hub, so from your lsusb output 03 is probably the ACM device. Note that if you unplug/replug devices after boot then the 1:1 relationship between USB address and slot ID breaks.

For each of the epNN subdirectories, run wc -l trbs. I would expect them all to be 512 lines in size.

josefcub commented 3 years ago

Here you go:

my sandbox:

512 ep00/trbs
16384 ep02/trbs
512 ep03/trbs
512 ep04/trbs

my dns server:

512 ep00/trbs
32768 ep02/trbs
512 ep03/trbs
512 ep04/trbs

kubernetes master:

512 ep00/trbs
4096 ep02/trbs
512 ep03/trbs
512 ep04/trbs

one of the workers:

512 ep00/trbs
8192 ep02/trbs
512 ep03/trbs
512 ep04/trbs

I verified their lsusb output was identical to my sandbox (since I don't generally plug anything else into them, and they're all running the same basic image, they all had the same lsusb output).

P33M commented 3 years ago

Ok that's a big smoking gun right there. There should be no need for 32k of TRBs in the ep ring of any device.

Can you post the full output of sudo lsusb -v -d 2047: ? Is there a simple test script that I can use on an equivalent ACM device?

josefcub commented 3 years ago
Bus 001 Device 003: ID 2047:03df Texas Instruments
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            2 Communications
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0         8
  idVendor           0x2047 Texas Instruments
  idProduct          0x03df
  bcdDevice            2.00
  iManufacturer           1 Compulab LTD
  iProduct                2 fit_StatUSB
  iSerial                 3 D9D0385131000900
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength       0x0043
    bNumInterfaces          2
    bConfigurationValue     1
    iConfiguration          4 MSP430 USB
    bmAttributes         0x80
      (Bus Powered)
    MaxPower              100mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         2 Communications
      bInterfaceSubClass      2 Abstract (modem)
      bInterfaceProtocol      1 AT-commands (v.25ter)
      iInterface              5 Compulab fit_StatUSB
      CDC Header:
        bcdCDC               1.10
      CDC Call Management:
        bmCapabilities       0x00
        bDataInterface          1
      CDC ACM:
        bmCapabilities       0x02
          line coding and serial state
      CDC Union:
        bMasterInterface        0
        bSlaveInterface         1
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval             255
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        1
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass        10 CDC Data
      bInterfaceSubClass      0
      bInterfaceProtocol      0
      iInterface              0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval             255
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x82  EP 2 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval             255
can't get device qualifier: Resource temporarily unavailable
can't get debug descriptor: Resource temporarily unavailable
Device Status:     0x0000
  (Bus Powered)

I'll need a little time to come up with a stub script that does the same thing, and I'll post it here once I get it to produce the same behavior.

josefcub commented 3 years ago

I've watched the example script long enough to see my sandbox go from 16384 to 32768, so I've decided to pass this script on for replication.

#!/bin/bash

ACMPort="ttyACM0"

WHITE="#222222"

function led_init () {

  stty -F /dev/$ACMPort 115200 raw -echo -echoe -echok -echoctl -echoke

}

# Initialize LED
led_init

while true; do

  echo $WHITE >/dev/$ACMPort
  sleep 1

done

Results best obtained on a nearly idle Pi (My most common victim just runs BIND9 and DHCPD, the second most common just sshd and outgoing ansible). It may take up to 3 days for the errors to trigger, but I've seen them crater as soon as 8 hours after the last reboot for the problem.

Also, thank you for taking a look and being so quick!

P33M commented 3 years ago

Ah. I can reproduce this on regular Raspbian with a Pi Zero pretending to be a CDC ACM device in about 3 minutes if I speed up the loop. Crucially, after a certain amount of time it looks like the dequeue pointer isn't being updated, so each time the enqueue pointer wraps around to the dequeue pointer position, the ring undergoes an expansion.

Now to see if it's a Linux bug or something to do with our xhci controller...

P33M commented 3 years ago

The issue is specific to the VL805 and independent of platform (fails in the same manner on PC). When the first link TRB in the transfer ring for the CDC interrupt IN endpoint is encountered, the dequeue pointer in the endpoint context forever ends up pointing to it.

P33M commented 3 years ago

@pelwell this issue is not specific to 64bit - can it be moved to the linux repo please? For some reason I don't have the option to move it.

pelwell commented 3 years ago

I would if I could - I haven't got past pleb status on this repo.

pelwell commented 3 years ago

@ghollingworth ?

P33M commented 3 years ago

The issue is quickly reproducible with the test case because of a coincidence of factors.

Naively avoiding the link TRB by double-incrementing the TR dequeue pointer if the driver would otherwise stop on it does fix the interrupt endpoint, but breaks the bulk IN endpoint in some other manner.

P33M commented 3 years ago

rpi-update firmware has a potential fix for this issue. Please test and report back.

josefcub commented 3 years ago

After eleven hours of testing with the new rpi-update firmware, I am happy to report that:

Tue 27 Oct 2020 06:37:00 PM CDT 512 /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/03/ep00/trbs 512 /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/03/ep02/trbs 512 /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/03/ep03/trbs 512 /sys/kernel/debug/usb/xhci/0000:01:00.0/devices/03/ep04/trbs 2048 total

everything seems to be working correctly. Thank you very much!

FossPrime commented 3 years ago

Update 2

This was probably water damaged or physically damaged. Perhaps glue around the wifi chip or water near the SD card... who knows. It was used in an outdoor project. It's now a hood ornament for my parts cabinet.

Original

Device: Raspberry Pi 4 Model B Rev 1.2 OS: Raspbian lite armhf and Raspbian desktop armhf

on first boot I wasn't getting UART output. I turned it on with enable_uart=1, but that floods my serial console with these messages. I'll try to set GPIO to np for 6,4. Ubuntu 20.04 server arm64 had uart on by default and did not experience the problem.

1.330583] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 4 state (0 ffffffff)
[    1.385313] mmc0: error -110 whilst initialising SD card
[    1.391797] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (0 ffffffff)
[    1.402180] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (0 ffffffff)
[    1.411293] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 4 state (0 ffffffff)
[    1.470772] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (0 ffffffff)
[  213.735018] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (0 ffffffff)
[  214.811252] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (0 ffffffff)
[  214.820272] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 4 state (0 ffffffff)
[  214.869327] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (0 ffffffff)
[  214.879636] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (0 ffffffff)

UPDATE 1

Turns out it may be a corrupt Sandisk Ultra Plus U1 A1 16GB card.

 ┌──────────────────────────────────────────────────────────┐
          │                                                          │
          │ No wireless interface found                              │
          │                                  
pickert commented 3 years ago

Very interesting as I've an RPi 4B 1GB that has been throwing these after the WiFi and BT suddenly failed after a raspbian version bump.. Can these be from faulty flashes or only HW damage? I always chalked it up to some "permanent" failure due to a ban firmware/rpi-update fiasco or something .. =\

[ 1.187791] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 2 state (0 ffffffff) [ 1.218822] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 6 state (0 ffffffff) [ 1.252569] raspberrypi-exp-gpio soc:firmware:gpio: Failed to set GPIO 4 state (0 ffffffff)

lurch commented 3 years ago

the WiFi and BT suddenly failed after a raspbian version bump.. Can these be from faulty flashes or only HW damage?

Easiest way to test is to use Raspberry Pi Imager to write the latest version of Raspberry Pi OS onto a fresh SD card, and see if you still get the same errors?

pickert commented 3 years ago

I have, many times over time.. wondered about vl805 eeprom corruption possibly? Any thoughts?

On Sat, Dec 5, 2020, 5:30 PM Andrew Scheller notifications@github.com wrote:

the WiFi and BT suddenly failed after a raspbian version bump.. Can these be from faulty flashes or only HW damage?

Easiest way to test is to use Raspberry Pi Imager to write the latest version of Raspberry Pi OS onto a fresh SD card, and see if you still get the same errors?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/linux/issues/3919#issuecomment-739430316, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQUXMJEJEEVBPIZFI7XU3STK7BBANCNFSM4S4GN4ZA .

lurch commented 3 years ago

I've been told that EEPROM corruption is extremely unlikely; and even if you did have a corrupted eeprom I think you'd have a more extreme failure-mode (e.g. a complete failure to boot) than just a few errors in a logfile :wink:

pelwell commented 3 years ago

What @lurch said. Plus, don't get confused about the various EEPROMs that are (or aren't) on a Pi 4: most Pi 4's sold so far have 2 EEPROMs - one containing the bootloader (previously kept in a file called bootcode.bin), and one containing the firmware for the VL805 USB controller. The most recent revision of the Pi 4 (and the CM4 and Pi 400) no longer have an EEPROM for the VL805 - it's firmware is part of the main EEPROM and loaded into RAM by the bootloader. There are signature checks for both firmwares, and the effect of corruption of either firmware is likely to be catastrophic.

Headcrabed commented 10 months ago

Hello @P33M , could this quirk get upstreamed?