raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.15k stars 5k forks source link

smsc95xx 1-1.1:1.0 eth0: Failed to read reg index 0x00000114: -110 #3088

Open GeorgeSapkin opened 5 years ago

GeorgeSapkin commented 5 years ago

Describe the bug Network connectivity drops out at random on CM3 with a custom carrier board with LAN9512. Device becomes completely unreachable over any network going through that chip (e.g. Ethernet, LTE).

To reproduce Unknown.

Expected behaviour Network connectivity doesn't randomly drop out.

Actual behaviour Network connectivity drops out.

System

Logs

Jul 18 20:53:50 data kernel: [277401.663895] smsc95xx 1-1.1:1.0 eth0: Failed to read reg index 0x00000114: -110
Jul 18 20:53:50 data kernel: [277401.663906] smsc95xx 1-1.1:1.0 eth0: Error reading MII_ACCESS
Jul 18 20:53:50 data kernel: [277401.663916] smsc95xx 1-1.1:1.0 eth0: MII is busy in smsc95xx_mdio_read
Jul 18 20:53:50 data kernel: [277401.663926] smsc95xx 1-1.1:1.0 eth0: Failed to read MII_BMSR
Jul 18 20:54:56 data kernel: [277466.864971] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x294/0x298
Jul 18 20:54:56 data kernel: [277466.864977] NETDEV WATCHDOG: eth0 (smsc95xx): transmit queue 0 timed out
Jul 18 20:54:56 data kernel: [277466.864982] Modules linked in: tun cdc_ether cdc_acm rtc_ds1307 brcmfmac brcmutil sha256_generic cfg80211 rfkill raspberrypi_hwmon hwmon bcm2835_codec(C) bcm2835_v4l2(C) v4l2_mem2mem bcm2835_mmal_vchiq(C) v4l2_common videobuf2_dma_contig videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev i2c_bcm2835 media vc_sm_cma(C) uio_pdrv_genirq uio fixed i2c_dev ip_table
s x_tables ipv6
Jul 18 20:54:56 data kernel: [277466.865151] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G         C        4.19.42-v7+ #1219
Jul 18 20:54:56 data kernel: [277466.865154] Hardware name: BCM2835
Jul 18 20:54:56 data kernel: [277466.865173] [<80111eac>] (unwind_backtrace) from [<8010d430>] (show_stack+0x20/0x24)
Jul 18 20:54:56 data kernel: [277466.865184] [<8010d430>] (show_stack) from [<8080ec80>] (dump_stack+0xd4/0x118)
Jul 18 20:54:56 data kernel: [277466.865196] [<8080ec80>] (dump_stack) from [<80120904>] (__warn+0x104/0x11c)
Jul 18 20:54:56 data kernel: [277466.865204] [<80120904>] (__warn) from [<80120974>] (warn_slowpath_fmt+0x58/0x74)
Jul 18 20:54:56 data kernel: [277466.865212] [<80120974>] (warn_slowpath_fmt) from [<807355b8>] (dev_watchdog+0x294/0x298)
Jul 18 20:54:56 data kernel: [277466.865223] [<807355b8>] (dev_watchdog) from [<801965a0>] (call_timer_fn+0x3c/0x198)
Jul 18 20:54:56 data kernel: [277466.865233] [<801965a0>] (call_timer_fn) from [<801967e8>] (expire_timers+0xec/0x14c)
Jul 18 20:54:56 data kernel: [277466.865242] [<801967e8>] (expire_timers) from [<80196900>] (run_timer_softirq+0xb8/0x1ec)
Jul 18 20:54:56 data kernel: [277466.865251] [<80196900>] (run_timer_softirq) from [<80102388>] (__do_softirq+0x190/0x3f0)
Jul 18 20:54:56 data kernel: [277466.865260] [<80102388>] (__do_softirq) from [<801268dc>] (irq_exit+0xfc/0x120)
Jul 18 20:54:56 data kernel: [277466.865271] [<801268dc>] (irq_exit) from [<8017ef24>] (__handle_domain_irq+0x70/0xc4)
Jul 18 20:54:56 data kernel: [277466.865281] [<8017ef24>] (__handle_domain_irq) from [<801021b4>] (bcm2836_arm_irqchip_handle_irq+0x60/0xa4)
Jul 18 20:54:56 data kernel: [277466.865288] [<801021b4>] (bcm2836_arm_irqchip_handle_irq) from [<801019bc>] (__irq_svc+0x5c/0x7c)
Jul 18 20:54:56 data kernel: [277466.865292] Exception stack(0x80d01ea0 to 0x80d01ee8)
Jul 18 20:54:56 data kernel: [277466.865299] 1ea0: bc56838c 00000040 00000000 00000000 bc568380 bc568380 00000040 bc56838c
Jul 18 20:54:56 data kernel: [277466.865305] 1ec0: 00000001 be5ff9c0 80c63a38 80d01efc 80d01f00 80d01ef0 8069fc9c 8082bdf0
Jul 18 20:54:56 data kernel: [277466.865309] 1ee0: a0000013 ffffffff
Jul 18 20:54:56 data kernel: [277466.865318] [<801019bc>] (__irq_svc) from [<8082bdf0>] (_raw_read_lock+0x18/0x38)
Jul 18 20:54:56 data kernel: [277466.865332] [<8082bdf0>] (_raw_read_lock) from [<8069fc9c>] (led_trigger_event+0x30/0x8c)
Jul 18 20:54:56 data kernel: [277466.865341] [<8069fc9c>] (led_trigger_event) from [<806a1664>] (ledtrig_cpu+0xd4/0xf0)
Jul 18 20:54:56 data kernel: [277466.865350] [<806a1664>] (ledtrig_cpu) from [<80109a78>] (arch_cpu_idle_exit+0x1c/0x20)
Jul 18 20:54:56 data kernel: [277466.865361] [<80109a78>] (arch_cpu_idle_exit) from [<801524d0>] (do_idle+0x100/0x17c)
Jul 18 20:54:56 data kernel: [277466.865371] [<801524d0>] (do_idle) from [<8015280c>] (cpu_startup_entry+0x28/0x2c)
Jul 18 20:54:56 data kernel: [277466.865379] [<8015280c>] (cpu_startup_entry) from [<808253e4>] (rest_init+0xbc/0xc0)
Jul 18 20:54:56 data kernel: [277466.865389] [<808253e4>] (rest_init) from [<80c00fb0>] (start_kernel+0x484/0x4b4)

Additional context I've already checked issues #60, #149, #552 and #746 but none of them really contain anything useful besides mentioning that the issue is already fixed.

nRESET pin on LAN9512 is pulled high and not connected to a GPIO pin.

lategoodbye commented 5 years ago

In order to exclude a possible hardware issue with your custom carrier board, it would be helpful to reproduce this issue on a Raspberry Pi board.

GeorgeSapkin commented 5 years ago

I cannot reproduce it with the official I/O board since it doesn't have Ethernet. And reproducing this on a full Pi doesn't seem relevant (different hardware). What else can I do to debug this?

lategoodbye commented 5 years ago

What's the difference between your custom IO board and a Raspberry Pi 3 regarding SMSC95xx wiring? How about providing your device tree source?

ckamas commented 5 years ago

I too am seeing this on a CM3 with an SMSC95xx. I too have looked at the other bug reports and was not able to find a solution to this issue.

I am confused about what in the device tree source you are looking for?

Jul 22 16:21:58 RevPi11734 kernel: [ 0.000000] Linux version 4.9.76-rt60-v7+ (admin@kunbus.de) (gcc version 4.8.3 20140303 (prerelease) (crosstool-NG linaro-1.13.1+bzr2650 - Linaro GCC 2014.03) ) #1 SMP PREEMPT RT Tue, 17 Jul 2018 14:20:12 +0200 ... Jul 18 12:21:52 RevPi11734 kernel: [ 3614.344611] smsc95xx 1-1.1:1.0 eth0: Failed to read reg index 0x00000114: -110 Jul 18 12:21:52 RevPi11734 kernel: [ 3614.344620] smsc95xx 1-1.1:1.0 eth0: Error reading MII_ACCESS Jul 18 12:21:52 RevPi11734 kernel: [ 3614.344626] smsc95xx 1-1.1:1.0 eth0: MII is busy in smsc95xx_mdio_read Jul 18 12:21:52 RevPi11734 kernel: [ 3614.344632] smsc95xx 1-1.1:1.0 eth0: Failed to read MII_BMSR ... Jul 18 23:19:11 RevPi11734 kernel: [43053.607537] smsc95xx 1-1.5.1:1.0 eth1: Failed to read reg index 0x00000114: -110 Jul 18 23:19:11 RevPi11734 kernel: [43053.607547] smsc95xx 1-1.5.1:1.0 eth1: Error reading MII_ACCESS Jul 18 23:19:11 RevPi11734 kernel: [43053.607553] smsc95xx 1-1.5.1:1.0 eth1: MII is busy in smsc95xx_mdio_read Jul 18 23:19:11 RevPi11734 kernel: [43053.607559] smsc95xx 1-1.5.1:1.0 eth1: Failed to read MII_BMSR

pelwell commented 5 years ago

-110 is a timeout, but I don't think I can help much more than that. Raspberry Pi boards don't exhibit this problem, and the rare and random nature of the errors makes it look much more like a hardware issue than software (unless it is a very subtle configuration error).

ckamas commented 5 years ago

I found a quick test for this issue using iperf. I have a custom board PI CM3L with dual ethernet. I can use iperf on one as a master and one as a slave. See here: https://serverfault.com/questions/127636/force-local-ip-traffic-to-an-external-interface/861465#861465

Using this the kernel/driver from Rasbian with mods by Kunbus errors out very quickly (<15 mins) with the timeout error as described above, but the version I built with Yocto has been running for several days now with no errors.

The good Yocto version of the kernel is: Linux raspberrypi-cm3 4.19.57 #1 SMP Thu Jul 11 23:40:23 UTC 2019 armv7l armv7l armv7l GNU/Linux

The Rasbian/Kunbus version is: Linux RevPi8592 4.9.76-rt60-v7+ #1 SMP PREEMPT RT Tue, 17 Jul 2018 14:20:12 +0200 armv7l GNU/Linux

GeorgeSapkin commented 5 years ago

We've checked the wiring on our board. We don't have full schematic for RPi3B (AFAIK they are not open-sourced), so we're basing this on Pi1B 2.1 schematic. The difference is that nRESET pin on LAN9512 (we don't need 4 USB ports so we're not using LAN9514) is not wired to any GPIO pin and is pulled high instead. We haven't changed anything in the Devicetree regarding this pin (i.e. LAN_RUN) either. This seems in accordance with LAN9512 specs, Table 2-3. However, according to @pelwell this is an issue. Is this the case and are we misreading LAN9512 specs?

lategoodbye commented 5 years ago

In order to answer your question, we need to know how do you provide the 25 MHz clock source (e.g. via GPIO of BCM2835).

GeorgeSapkin commented 5 years ago

XI and XO pins of LAN9512 are wired same as in RPiB 2.1 schematic to an external 25 MHz clock source. I'm going to try to connect an unused GPIO pin from CM3 to nRESET on LAN9512, configure it as LAN_RUN in DT and see if I can still reproduce the issue.

GeorgeSapkin commented 5 years ago

We've made a few test boards with nRESET connected to GPIO pin 13 and configured that pin as LAN_RUN in dt-blob.bin. The issue still occurs. Interestingly, when this happens, Wi-Fi also becomes unresponsive. Wi-Fi chip is connected over SDIO so it doesn't go through LAN9512.

We're currently investigating whether this issue could be related to crosstalk from LED traces.

xyklex commented 5 years ago

@GeorgeSapkin I'm having this same issue, I'm building my self the Raspbian operating system with pi-gen with some custom configurations, my last build was by Oct 2, my entire setup is the RPi connected to a ST32 board (it has a serial connection and a storage drive inside it), for testing there is no sensors/gpio connections to the ST32 boards, and for production it has several others devices connected to it.

Testing case works ok, no issues, I am able to mount the storage into linux and all the applications that interact with it just works.

On the other side, production does not works, when I plug in the ST32 board to RPi it gives me this:

[  163.934268] usb 1-1.2: new full-speed USB device number 5 using dwc_otg
[  164.067904] usb 1-1.2: New USB device found, idVendor=0483, idProduct=374b, bcdDevice= 1.00
[  164.067923] usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  164.067932] usb 1-1.2: Product: STM32 STLink
[  164.067942] usb 1-1.2: Manufacturer: STMicroelectronics
[  164.067953] usb 1-1.2: SerialNumber: 066CFF383930434B43152720
[  164.075944] usb-storage 1-1.2:1.1: USB Mass Storage device detected
[  164.082453] scsi host0: usb-storage 1-1.2:1.1
[  164.142860] usbcore: registered new interface driver uas
[  164.163874] cdc_acm 1-1.2:1.2: ttyACM0: USB ACM device
[  164.167005] usbcore: registered new interface driver cdc_acm
[  164.167017] cdc_acm: USB Abstract Control Model driver for USB modems and ISDN adapters
[  165.124578] ERROR::assign_and_init_hc:1408: assign_and_init_hc: Failed to allocate memory to handle non-dword aligned buffer case

[  170.714285] cdc_acm 1-1.2:1.2: failed to set dtr/rts
[  175.195608] scsi 0:0:0:0: Direct-Access     MBED     microcontroller  1.0  PQ: 0 ANSI: 2
[  175.197179] sd 0:0:0:0: [sda] 4168 512-byte logical blocks: (2.13 MB/2.04 MiB)
[  175.197653] sd 0:0:0:0: [sda] Write Protect is off
[  175.197667] sd 0:0:0:0: [sda] Mode Sense: 03 00 00 00
[  175.198153] sd 0:0:0:0: [sda] No Caching mode page found
[  175.198164] sd 0:0:0:0: [sda] Assuming drive cache: write through
[  175.234518] sd 0:0:0:0: Attached scsi generic sg0 type 0
[  175.286560] sd 0:0:0:0: [sda] Attached SCSI removable disk
[  200.074285] cdc_acm 1-1.2:1.2: failed to set dtr/rts
[  205.114283] cdc_acm 1-1.2:1.2: failed to set dtr/rts
[  212.074314] smsc95xx 1-1.1:1.0 eth0: Failed to read reg index 0x00000114: -110
[  212.074331] smsc95xx 1-1.1:1.0 eth0: Error reading MII_ACCESS
[  212.074341] smsc95xx 1-1.1:1.0 eth0: MII is busy in smsc95xx_mdio_read
[  212.074351] smsc95xx 1-1.1:1.0 eth0: Failed to read MII_BMSR

Two ST32 boards has the same firmware, RPi is the same, I tested different power sources (thinking it was that because prod ST32 board has other devices connected).

I have a previous RPi image version that is working on prod though, and it is almost one year old.

This is my report of the situation, I will see if I can rollback to another kernel version and see how it goes.

GeorgeSapkin commented 5 years ago

Looks like it might be related to https://github.com/raspberrypi/linux/issues/2924 We do have multiple Ethernet adapters: 1 from LAN9512 and 2 from the modem.

xyklex commented 5 years ago

@GeorgeSapkin thanks for point me out that issue, my solution right now was to compile a 4.14.107 kernel and modules and that solved the issue I mentioned and another issue I had with the HDMI screen being yellow when the kernel starts. But for now this was my workaround.

GeorgeSapkin commented 5 years ago

@xyklex I'm glad to hear it solved your issue. We're stress-testing 4.14.x branch as well to see if it's an improvement.

EloyCrespo commented 4 years ago

@GeorgeSapkin could you tell me if you solved the problem with version 4.14.x?

wickywaka commented 2 years ago

@ckamas We are facing this or very similar issue on our kunbus revpi as well. Did you manage to find a solution?