raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.13k stars 1.68k forks source link

Ethernet locks up when bridged with wifi #673

Closed dickontoo closed 7 years ago

dickontoo commented 7 years ago

On a Pi 3, the ethernet will randomly lock up when bridged with the wifi interface. This takes the form of:

smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped

after which the wired ethernet stops working. Wifi continues as usual; devices associated with the Pi when in AP mode can ping it and each other, but (obviously) no packets are forwarded to the wired network. This severely limits its use as a wifi AP.

kevent 0 would appear to be EVENT_TX_HALT, which is triggered when the interrupt handler has too much work to do, and hands the processing off to the kworker thread. For some reason, although the worker thread seems to be executing correctly, the condition isn't cleared, probably in the hardware. I've no idea why this bug seems to be tickled by bridging it with the wifi. I've now reached the end of my kernel knowledge.

There's a thread here on the forums.

Thanks.

Minims commented 7 years ago

I've just received my Rpi 3 to make an AP Wifi with bridging, and after 1 hour, my eth0 stops working. I've exactly the same logs. It's seems it only appears when bridging.

brcmfmac: brcmf_proto_bcdc_hdrpull: wlan0: non-BCDC packet received, flags 0x3e smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped

anthem commented 7 years ago

I've set up just this type of config with hostapd and eth0 bridged to wlan0 on two of my RPi 3+ boards. Interestingly, one displays this issue, and one does not, when updated to the same version of Raspbian PIXEL with the same firmware.

I'd be happy to provide any information desired that would help debug the issue, as I'd like to set up more RPi APs but this problem is very limiting.

One difference that might matter here is my board that locks up reports CPU Revision (in /proc/cpuinfo) a02082 (mfg by Sony), whereas the one that works reports a22082 (mfg by Embest). I might have another Sony board and definitely have more Embest ones, so if I get a chance I'll try this config on another board of each type and report back.

Update: No dice there. I swapped my problematic Sony board with an Embest using the same SD card, and the problem is still present.

Other interesting findings though:

Again, if any other information can be of help with this, I'm happy to oblige. I have a system where this recurs reliably.

RobinMcCorkell commented 7 years ago

I have this same issue, on Arch Linux ARM running Linux 4.4.39-1-ARCH. I'm bridging Ethernet with the onboard WLAN, along with another USB WLAN adapter. I'm running all of the usual stability cmdline.txt options, but none prevent the issue.

I have developed a workaround, which works when run manually but the automated script has yet to prove its worth (get back to me tomorrow, when it almost certainly will have happened again). The idea is to reset the network adapter by unbinding it from the smsc95xx driver then rebinding it. This resets the failed state and allows the adapter to continue working - it also re-adds itself to the bridge, and any network connections aren't dropped!

Here is my script:

#!/bin/bash

device=$1
driver=$2

uptime=($(< /proc/uptime))
start_time=${uptime[0]}
start_time=${start_time%.*}

while read -r _ ltime _ usbid _; do
        ltime=${ltime%.*]}
        if (( ltime > start_time )); then
                echo "Monitored device $device encountered an error, resetting" >&2
                echo $usbid > /sys/bus/usb/drivers/$driver/unbind
                echo $usbid > /sys/bus/usb/drivers/$driver/bind
        fi
done < <(dmesg -w | grep --line-buffered "${driver} [^ ]* ${device}: kevent [[:digit:]]* may have been dropped")

Run as a systemd service, passing in the interface name and the driver (smsc95xx in this case). As previously mentioned, this hasn't been tested yet, but the underlying principle is sound.

dickontoo commented 7 years ago

Interesting, thanks.

It seems to need an

ifconfig eth0 up

at the appropriate point (based on fiddling a bit this afternoon), but I, too, am waiting for the usual lockup.

RobinMcCorkell commented 7 years ago

That script doesn't work perfectly, since the dmesg parsing is a bit broken. I've rewritten it in Perl (with it's far more substantial regex parsing) and this appears to work:

#!/usr/bin/env perl

use strict;
use warnings;

my ($device, $driver) = @ARGV;

die unless open(UPTIME, "<", "/proc/uptime");
my $uptime = <UPTIME>;
die unless $uptime =~ /^([\d.]+) ([\d.]+)$/;
my $start_time = $1;

die unless open(DMESG, "/usr/bin/dmesg -w |");
while (<DMESG>) {
        if ($_ =~ /\[ *([\d.]+)\] $driver ([^ ]+) $device: kevent \d+ may have been dropped/) {
                my $dmesg_time = $1;
                my $usbid = $2;
                if ($dmesg_time > $start_time) {
                        print STDERR "Monitored device $device encountered an error, resetting\n";
                        die unless open(UNBIND, ">", "/sys/bus/usb/drivers/$driver/unbind");
                        print UNBIND "$usbid\n";
                        close UNBIND;
                        die unless open(BIND, ">", "/sys/bus/usb/drivers/$driver/bind");
                        print BIND "$usbid\n";
                        close BIND;
                }
        }
}
jamsch0 commented 7 years ago

Also experiencing this on a RPi 2 board running Arch Linux ARM 4.4.41-1-ARCH, bridging ethernet to a USB WLAN adapter.

@Xenopathic How is your bridge etc configured? I tried running your perl script as a systemd service and the device (un)binds correctly, but afterwards the bridge doesn't exist and eth0 is left down with no master set. (Sorry if that's a newbie question) For reference, my bridge is setup via netctl with this profile:

Description="Ethernet Bridge"
Interface=br0
Connection=bridge
BindsToInterfaces=(eth0)

# Ensure br0 is given MAC address of eth0
IP=no
ExecUpPost="ip link set dev br0 address $(cat /sys/class/net/eth0/address); IP=dhcp; ip_set"
ExecDownPre="IP=dhcp"
RobinMcCorkell commented 7 years ago

I have my bridge set up with systemd-networkd, so I guess the reason it gets auto-added back into the bridge is because networkd notices a 'new interface' and so triggers its configuration for it, which assigns it into the bridge. I'm surprised you are finding issues with your bridge disappearing though - the script only un/binds the Ethernet adapter from its driver, so the bridge should remain. Then when the adapter comes back, you just need to make sure it is re-added to the bridge (unless I am wrong about systemd-networkd doing this, and Linux just does that automatically)

On a side note, I can confirm that my script works properly, and has in fact saved my device at least once so far!

jamsch0 commented 7 years ago

Rather than try and mess around with netctl I tried using systemd-networkd and everything works well, so I'll be using that from now on. Thanks again for your workaround script!

JamesH65 commented 7 years ago

Hi guys, I'm going to be looking in to this issue at Pi Towers. Does anyone have any recommended setup instructions for setting up a bridge using the latest Raspbian release to save me figuring it out? I want to hard connect two Pi's to make a network (need to keep off office network), one being an AP, then hammer it to try and replicate.

James

dickontoo commented 7 years ago

The instructions for normal Debian work well. Basically, just install brutils and hostapd, configure hostapd correctly (eg.

interface=wlan0
bssid=02:27:xx:xx:xx:90
bridge=br0
ctrl_interface=/var/run/hostapd
ctrl_interface_group=0

wds_sta=1
driver=nl80211

ssid=myssid
country_code=GB
ieee80211d=1
hw_mode=g
channel=6
auth_algs=3
ignore_broadcast_ssid=0
wmm_enabled=1
wmm_ac_bk_cwmin=4
wmm_ac_bk_cwmax=10
wmm_ac_bk_aifs=7
wmm_ac_bk_txop_limit=0
wmm_ac_bk_acm=0
wmm_ac_be_aifs=3
wmm_ac_be_cwmin=4
wmm_ac_be_cwmax=10
wmm_ac_be_txop_limit=0
wmm_ac_be_acm=0
wmm_ac_vi_aifs=2
wmm_ac_vi_cwmin=3
wmm_ac_vi_cwmax=4
wmm_ac_vi_txop_limit=94
wmm_ac_vi_acm=0
wmm_ac_vo_aifs=2
wmm_ac_vo_cwmin=2
wmm_ac_vo_cwmax=3
wmm_ac_vo_txop_limit=47
wmm_ac_vo_acm=0

ieee80211n=1
eap_server=0
own_ip_addr=127.0.0.1

is fine; no crypto as it's just a test, and you may want to use a less-busy channel), configure the bridge as usual (eg.

iface eth0 inet static
        address 0.0.0.0

auto wlan0
iface wlan0 inet static
        address 0.0.0.0
        pre-up ifconfig wlan0 hw ether 02:bb:xx:xx:xx:90

auto br0
iface br0 inet static
        pre-up brctl addbr br0
        [...]
        bridge_ports eth0
        bridge_stp on
        bridge_fd 1
        bridge_waitport 1

or similar), reboot, associate something with the AP and wait for it to crash. Unfortunately this may take a couple of weeks, or possibly seconds. When it does go, a reboot is likely to immediately trigger it again, which is what indicates to me that there's some packet its choking on that's being resent by whatever station due to a lack of reply.

My MacBook Pro seems to trigger it more frequently than another Pi (Pi1 with an Edimax USB dongle), but that's just my observation and may be wrong; I have no stats.

Good luck, and thanks. I could do with a proper fix for this.

JamesH65 commented 7 years ago

Thanks - will get to it on Monday.

On 27 January 2017 at 16:05, Dickon Hood notifications@github.com wrote:

The instructions for normal Debian work well. Basically, just install brutils and hostapd, configure hostapd correctly (eg.

interface=wlan0 bssid=02:27:xx:xx:xx:90 bridge=br0 ctrl_interface=/var/run/hostapd ctrl_interface_group=0

wds_sta=1 driver=nl80211

ssid=myssid country_code=GB ieee80211d=1 hw_mode=g channel=6 auth_algs=3 ignore_broadcast_ssid=0 wmm_enabled=1 wmm_ac_bk_cwmin=4 wmm_ac_bk_cwmax=10 wmm_ac_bk_aifs=7 wmm_ac_bk_txop_limit=0 wmm_ac_bk_acm=0 wmm_ac_be_aifs=3 wmm_ac_be_cwmin=4 wmm_ac_be_cwmax=10 wmm_ac_be_txop_limit=0 wmm_ac_be_acm=0 wmm_ac_vi_aifs=2 wmm_ac_vi_cwmin=3 wmm_ac_vi_cwmax=4 wmm_ac_vi_txop_limit=94 wmm_ac_vi_acm=0 wmm_ac_vo_aifs=2 wmm_ac_vo_cwmin=2 wmm_ac_vo_cwmax=3 wmm_ac_vo_txop_limit=47 wmm_ac_vo_acm=0

ieee80211n=1 eap_server=0 own_ip_addr=127.0.0.1

is fine; no crypto as it's just a test, and you may want to use a less-busy channel), configure the bridge as usual (eg.

iface eth0 inet static address 0.0.0.0

auto wlan0 iface wlan0 inet static address 0.0.0.0 pre-up ifconfig wlan0 hw ether 02:bb:xx:xx:xx:90

auto br0 iface br0 inet static pre-up brctl addbr br0 [...] bridge_ports eth0 bridge_stp on bridge_fd 1 bridge_waitport 1

or similar), reboot, associate something with the AP and wait for it to crash. Unfortunately this may take a couple of weeks, or possibly seconds. When it does go, a reboot is likely to immediately trigger it again, which is what indicates to me that there's some packet its choking on that's being resent by whatever station due to a lack of reply.

My MacBook Pro seems to trigger it more frequently than another Pi (Pi1 with an Edimax USB dongle), but that's just my observation and may be wrong; I have no stats.

Good luck, and thanks. I could do with a proper fix for this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D275701687&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=z37M8yfyMAPI27Drls23gHFtZ1JHlm_2RQcuH4zdwvs&s=6-5pJ21hTYRCulKxsICv26lQvMBl5bd2wrV1QodWpWA&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHRVBTQJpgx-5FaJgnW76jEQAp63Tzbks5rWhWvgaJpZM4KaDH1&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=z37M8yfyMAPI27Drls23gHFtZ1JHlm_2RQcuH4zdwvs&s=Hk_GJqnh0xvOIcG4QZ3jGyGQ8SU6f_l5C1f3hoOxZ0Q&e= .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

JamesH65 commented 7 years ago

So, looking at this - would like a bit more information on setup to make sure I am testing the right thing.

I have two Pi3's. I an going to connect them together using a direct ethernet cable. One will act as a Wifi AP, the other will be the device I want to connect to from an external device. So, external device->WifiAPPi->destinationPi.

Bits I need confirmation on - the Pi in the middle, the AP, will also be a static IP DHCP server, so the Pi at the end can grab a IP address from it.

I was intending to put a load of data files on the pi at the end, and the external device (probably another Pi, but irrelevant) and just transfer stuff back and forth to hammer the WiFi and ethernet connections.

Does that simulate the scenario that is causing the problem?

The instructions above are for the Pi in the middle (the AP)

Sorry about the questions, this sort of networking is not something I am familiar with, so quite a bit to learn before I can really start testing.

anthem commented 7 years ago

That sounds about right, JamesH65. At least for my setup.

In my environment I have two Pi3's that run only hostapd and both exhibit the problem. I've been able to work around it 90% of the time with a separate script that simply runs ifdown/if up when seeing the kevent message.

I'm really not certain if pairing or traffic loads affect the behavior. I have about a dozen wifi hosts that share 3-4 APs of different implementations transferring light to moderate internet traffic, and I see the problem at least 2-3 times a week.

JamesH65 commented 7 years ago

OK, been running a simple scp of small files back and forth test for the last 20 hours (60MB transfered), with no issues. I wonder if there is anything in particular that people might think sets this off?

dickontoo commented 7 years ago

Association / dissociation. Otherwise, just hang in there; mine's currently having issues roughly once every two days or so, with 1-3 devices associated with it (greenhouse Pi, Android mobile, OS X laptop).

Intermittent bugs are annoying...

JamesH65 commented 7 years ago

This comment in the usbnet.c file doesn't fill me with confidence...also seen in all the other places where the error message can be logged.

`/ some work can't be done in tasklets, so we use keventd

dickontoo commented 7 years ago

The question is, what does everything else that uses that do when it fails?

I really need to learn more about the kernel internals...

JamesH65 commented 7 years ago

I'm having a look at said kernel internals, but they are not easy to follow, especially without being able to replicate the issues at will.

I suspect this is an interaction with the ethernet driver where is doesn't return quickly enough from 'something'...

On 2 February 2017 at 14:04, Dickon Hood notifications@github.com wrote:

The question is, what does everything else that uses that do when it fails?

I really need to learn more about the kernel internals...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-276965423, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHRQEoJcP0XOipJYzur5VOzW5eSoWks5rYeJSgaJpZM4KaDH1 .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

JamesH65 commented 7 years ago

Hmm, backtracking from the error message through the code, this appears to be an problem emanating from the USB network subsystem (usbnet.c), it's making called down to the usb core (urb.c and hcd.c) system which are returning -EPIPE. Does anyone seeing the problem have anything else plugged in to the USB on the devices that stop working or are you just using the ethernet port?

My test code is still running fine, so finding it through testing is not going to be easy.

dickontoo commented 7 years ago

I occasionally have a keyboard plugged in, but ATM nothing. Still happens.

Noltari commented 7 years ago

It takes less than one hour of simply bridging ethernet to WiFi on the Raspberry Pi 3 when running latest LEDE, and with a single WiFi client connected. https://git.lede-project.org/?p=source.git;a=summary

I've purchased an usb to ethernet adapter in order to do some more testing and will report back my results.

JamesH65 commented 7 years ago

I'm still fighting with this. My test rig hasn't shown the error (4 days of running), but when I built a custom kernel with more debugging, it showed up almost immediately. Here are some thoughts/analyses for posterity.

The kevent dropped messages originates in usbnet.c/usbnet_defer_kevent. The schedule_work function is failing, which is caused, according to my reading of the docs/code, by the kevent already being in the queue. The operation being requested is EVENT_TX_HALT, which I presume is an effort to abort the current transfer for some reason. The function that SHOULD be called as a result of the workqueue kevent is usbnet_deferred_kevent.

The cause of the call to usbnet_defer_kevent is from the tx_complete function, as a result of urb->status member being -EPIPE.

tx_complete is called from the dcd.c:__usb_hcd_giveback_urb function, and is called with interrupts saved/off (local_irq_save()).

So the questions are:

  1. Why is there a repeated request to schedule_work that fails 1a. Is this due to multiple quick fails needing to be handled, or a previous fail not being dealt with fast enough?
  2. What causes the original EPIPE error that result in the cascade failure described above?
  3. This does not appear to happen with other USB dongles, so does this imply the usb driver for the onboard ethernet is the issue? Or does it happen with other devices, but rarely?
  4. Why does bridging make it more common?

I'll keep investigating, concentrating on 2 at the moment.

On 8 February 2017 at 10:34, Álvaro Fernández Rojas < notifications@github.com> wrote:

It takes less than one hour of simply bridging ethernet to WiFi on the Raspberry Pi 3 when running latest LEDE, and with a single WiFi client connected. https://git.lede-project.org/?p=source.git;a=summary https://urldefense.proofpoint.com/v2/url?u=https-3A__git.lede-2Dproject.org_-3Fp-3Dsource.git-3Ba-3Dsummary&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=XTSMJ-n_Vn3py9PxdTG4gsLXiX4ilJwemQV4GVmVYdc&s=XlgMnp8T7rC1CpgmfJ_1HRDpZ6hyGdXijVLfqaRbvZg&e=

I've purchased an usb to ethernet adapter in order to do some more testing and will report back my results.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D278291289&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=XTSMJ-n_Vn3py9PxdTG4gsLXiX4ilJwemQV4GVmVYdc&s=-kqVu3I1Ea2kmvgHOPlthgOSekcLrjbMq6Z3VrE3qfc&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHY6QbX69ASDLN1CGX24k3GyLDZ4pks5raZopgaJpZM4KaDH1&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=XTSMJ-n_Vn3py9PxdTG4gsLXiX4ilJwemQV4GVmVYdc&s=tpX-F-gQGq2geuN-nHTt-sQFtq23LF0Irvq2hiRbRdE&e= .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

JamesH65 commented 7 years ago

OK, here is a instrumented log, that I need to put somewhere safe, seems to show what is going on. The bolded area shows that two EPIPE errors occur next to each other without the first being handled by the RX'd lines. I'll try and figure out why that might happen.

[ 1002.205864] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1002.205905] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1002.206366] RX'd a EVENT_TX_HALT drivers/net/usb/usbnet.c usbnet_deferred_kevent 1105 [ 1003.205909] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1003.205949] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1003.206037] RX'd a EVENT_TX_HALT drivers/net/usb/usbnet.c usbnet_deferred_kevent 1105 [ 1006.215722] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1006.215773] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1006.215859] RX'd a EVENT_TX_HALT drivers/net/usb/usbnet.c usbnet_deferred_kevent 1105 [ 1006.219218] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1006.219267] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1006.219339] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1006.219355] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1006.219375] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1006.219380] schedule_work failed kevent 0 [ 1006.219400] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped [ 1006.219412] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1006.219419] schedule_work failed kevent 0 [ 1006.219433] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped [ 1006.219620] RX'd a EVENT_TX_HALT drivers/net/usb/usbnet.c usbnet_deferred_kevent 1105 [ 1007.219353] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1007.219382] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1007.219593] RX'd a EVENT_TX_HALT drivers/net/usb/usbnet.c usbnet_deferred_kevent 1105 [ 1008.219292] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1008.219341] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1008.219482] RX'd a EVENT_TX_HALT drivers/net/usb/usbnet.c usbnet_deferred_kevent 1105 [ 1011.220647] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1011.220712] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1011.220896] RX'd a EVENT_TX_HALT drivers/net/usb/usbnet.c usbnet_deferred_kevent 1105 [ 1012.219368] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1012.219395] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222 [ 1012.219549] RX'd a EVENT_TX_HALT drivers/net/usb/usbnet.c usbnet_deferred_kevent 1105 [ 1013.219369] -EPIPE error in drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c _complete 303 [ 1013.219398] -EPIPE error drivers/net/usb/usbnet.c tx_complete 1222<

Noltari commented 7 years ago

I purchased this USB->Ethernet adapter from amazonbasics, which is an ASIX 88179. As you can see in the booltog smsc95xx generates problems almost instantly. Then I switched the ethernet cable from the smsc95xx eth to the ax88179 one and I haven't had a single problem since yesterday, which makes me think that the issue comes from the smsc95xx driver...

JamesH65 commented 7 years ago

I'm slowing backtracking this through the kernel.

There is clearly as issue with the DWC detecting multiple EPIPE errors, and the final result is one is sitting in the system workqueue when another arrives. This causes the second to be dropped, and I suspect means that certain condition are not reset correctly. I've checked timing, there appears to be no major differences in a run that works and a run that doesn't which is a bit odd. So I am not sure why the item in the workqueue is not being processed in time.

Currently looking in to the source of the origin DWC EPIPE error. This is caused by a stall condition, so far I have tracked the source back to line 2141 in /drivers/usb/host/dwc_otg_hcd_intr.c

JamesH65 commented 7 years ago

Note 1 : right before almost every stall detected in dwc_otg_hcd_intr.c:2141, there is a message

brcmfmac: brcmf_proto_bcdc_hdrpull: wlan0: non-BCDC packet received, flags 0x7e

This message comes from the /drivers/net/wireless/brcm80211/brcmfmac/bcdc.c: file, line 300.

My suspicion is that whilst his is probably not the root cause of the issue (the kevent dropped has been seen on Pi's prior to v3) it is probably exacerbating it. Some investigation should probably be done to see what this error actually means.

Note 2: Couple of possible hacks might get this working.

In netusb.c, it uses the system workqueue. My suspicion is that this work queue is very busy at the point the failure occurs. So, when it tries to schedule work, I could, if the schedule fail, flush the workqueue and resubmit. However, this takes an indeterminate amount of time, and the schedule call really needs to be done quickly. Alternatively, I could define a new workqueue that is only used in netusb.c, this is guaranteed to not be busy. This would add an extra thread to the system.

Both of these avoid trying to fix the real issue, which I suspect is too many stalls being found in the communications system. Get rid of those, and the probability of having the stall clash is massively reduced.

JamesH65 commented 7 years ago

If I were to publish a test kernel build with one of the above 'fixes' in it, would anyone be able to test it? It would need a rpi-update to a particular git hash (that should update modules, if I can find the right hash), then copying the test kernel to the boot partition.

dickontoo commented 7 years ago

Yup. Mine's doing duty as a media player and wifi AP, so as long as those two functions are fine, I'm happy to test.

JamesH65 commented 7 years ago

I'm having a bit of a rethink on whether my idea above would actually work, I'll try and update later today, but have anther job to do so might be next week.

Noltari commented 7 years ago

@JamesH65 I did some tests with dwc2 yesterday and nothing changes, so I don't think it's related to dwc_otg or brcmfmac drivers, but to the smsc95xx driver itself.

JamesH65 commented 7 years ago

I suspect you are right, but there is clearly an issue with multiple stall conditions not being handled effectively at the usbnet level, which is where the error exhibits. This is likely caused by the smsc95xx driver, although finding out specifically why is not easy.

JamesH65 commented 7 years ago

No real progress, but wanting to store my analysis of the call stack (manually determined rather than debugger) that results in this issue.

dwc_otg_hcd_irq()                in   dwc_otg_hcd_linux.c:965
dwc_otg_hdc_handle_intr()        in   dwc_org_hcd_intr.c:76
dwc_otg_hcd_handle_hc_intr()     in   dwc_org_hcd_intr.c:631
dwc_otg_hcd_handle_hc_n_intr     in   dwc_org_hcd_intr.c:2635
handle_hc_chhltd_intr()          in   dwc_org_hcd_intr.c:2251
handle_hc_chhltd_intr_dma()      in   dwc_org_hcd_intr.c:2088                

At this point the code does some reg reads to get the particular interrupt detected, in our fail case this is a stall

handle_hc_stall_intr()           in  dwc_org_hcd_intr.c:1378

The pipe type at this stage is  UE_CONTROL

__complete()                     in  dwc_otg_hdc_linux.c:273
usb_hcd_giveback_urb()           in  hcd.c:1812
__usb_hcd_giveback_urb()         in  hcd.c:1722

[local_irq_save done at this point]

tx_complete()                    in usbnet.c:1213
usbnet_defer_kevent()            in usbnet.c:452
schedule_work ----->>>> fail dues to work item already on the queue

[local_irq_restore]
JamesH65 commented 7 years ago

Just a comment on the brcmfmac: brcmf_proto_bcdc_hdrpull: wlan0: non-BCDC packet receive, flags 0x7e message which I been tracking down in case it is relevant.

This is part of the BRCM CDC driver for the Wifi, its trying to extract the BRCM specific protocol header from the data stream, and finding that the version number is incorrect. Since this error often appears immediately after a stall detection (note, the stall detection happens much more frequently that the kevent dropped error - the kevent only happens when two stall detects arrive, the second before the first is processed) I suspect this is actually caused by the stall. Interesting that the data is always the same - the flags are set to 0x7e which is part of the header - it should be 0x2x where 2 is the protocol version number.

JamesH65 commented 7 years ago

Further comment on the brcmfmac: brcmf_proto_bcdc_hdrpull: wlan0: non-BCDC packet receive, flags 0x7e

This error ONLY occurs when the dhcpcd daemon is running, and is clearly related to DCHP requests being made. This is also the packet that appears to cause USB stalls, but only if there are setup issues with the bridging - once I got the setup correct, the stalls seems to have gone away. Which also means the only way of replicating the stalls and hence the kevent 0 issue has gone....I'll keep looking.

JamesH65 commented 7 years ago

Just out of interest, to the people seeing the kevent issue/lockup while bridging - do you have dhcpcd running (I don't think you should if the Pi is bridging and is also an access point and therefore providing the DCHP service itself)?

dickontoo commented 7 years ago

I don't -- I have another machine on the network (my router) which provides network services of that sort. DHCP requests will be relayed via the Pi's wifi interface for clients at that side of the house, though.

The flags I typically see are 0x56, and I usually see a few of them on boot (whether or not the interface crashes).

ATM, I'm seeing the interface drop out without any non-BCDC errors being logged. DHCP may be an issue, though: there was one time I had horrendous problems when I had a dozen or so friends round (with their mobile phones) and we tried to watch something. Crashed every 10-20 minutes or so until most of them left.

prof7bit commented 7 years ago

dhcpcd is the DHCP client, it seems to be running by default on an fresh raspian installation. It seems to me it will automatically try to configure all interfaces that are set to inet manual, I can't find anything in my dhcpcd.conf that tells it to do anything with any specific interfaces, yet it will try to assign addresses to all my interfaces, so I assume this is the default behavior, I cannot find anything in its documentation where this behavior is explained in detail (it only mentions something about eth0 being the default interface for dhcpcd). I might also be wrong about the "inet manual" assumption, maybe someone else can explain how dhcpcd is supposed to work.

I'm now going to try to configure all my interfaces static (in /etc/network/interfaces) and disable dhcpcd entirely and see whether this will make any difference.

JamesH65 commented 7 years ago

Yes, the client. Try disabling entirely, you'll need to remove some entries related to it from the systemd folder.

e.g.

check

/etc/systemd/system/multi-user.target.wants

and remove the dhcpcd entries. On reboot, none of the interfaces (wlan0, eth0) should be assigned IP addresses, this should be handled by the bridge (br0). Assuming your setup is similar to mine.

Would really like to know if it makes the issues go away!

prof7bit commented 7 years ago

I stopped and disabled it with systemd and now there is no dhcpcd process running anymore. I set all interfaces to inet manual except br0 which I configured inet static and supplied it valid ip address, mask and gateway. No dhcp related process is visible in htop but the message brcmf_proto_bcdc_hdrpull: wlan0: non-BCDC packet received, flags 0x56 still appears every few minutes and after 5 minutes my wlan bridge stops working again.

So it seems to me that the dhcp client has nothing to do with it.

JamesH65 commented 7 years ago

Hmm, interesting. I don't get any hdrpull issues when I have the DHCP client disabled/not running. There's clearly some difference in your setup over mine.

How many devices do you have talking to the AP? Mine is a very simply setup, Pi set up as AP with bridging, with one Pi attached to ethernet port and one to the wlan, and I copy stuff back and froth from the Pi attached wirelessly to the one attached ethernetly. With dhcp off, it ran all last night (18 hours) with no reports, as soon as I turned it on, I was getting errors with 5 minutes. Could be coincidence, but further test would indicate not.

Slippery little issue this one is.

dickontoo commented 7 years ago

Damn, sorry; my mistake. I always use dhclient when using dhcp, and missed the 'c'.

No, I'm not running either dhcpd, dhcpcd, or dhclient on my Pi exhibiting the problem. As I have had absolutely zero luck with uSD cards, the Pi in question has an NFS root, which the kernel configures on boot with a one-time dhcp lease it never renews. (The Perl recovery script posted above runs from a chroot on the uSD card).

I got thoroughly pissed off with systemd failures (it managed to kill sshd for reasons I still don't understand, because the logging subsystem crashed and filled the logs(!) with errors about it failing to restart), so blew it away and replaced it with the tried-and-tested sysvinit (which everything else I run uses and seems to still work fine).

prof7bit commented 7 years ago

I also tried using "inet dhcp" on br0, this will invoke dhclient instead of dhcpcd but it also makes no difference.

JamesH65 commented 7 years ago

Weirder and weirder.

Ran for 1h15m with nothing in dmesg, then started up dhcpcd and within 1 second (yes, 1s) was getting deferred kevents, and after about 5 minutes, a kevent dropped which kills eth0.

So it seems I have a solid way of replicating that gives the same symptoms are what you guys are seeing. I am suspecting some sort of network dhcp packet perhaps related to dhcp clients (not necessarily on the AP itself) is causing the issue, but still to confirm. I also have a vague suspicion this packet needs to come in over the Wifi which might indicate an issue with the Wifi driver rather than the ethernet.

prof7bit commented 7 years ago

How many devices do you have talking to the AP? Mine is a very simply setup, Pi set up as AP with bridging, with one Pi attached to ethernet port and one to the wlan

seems to be basically the same as mine:

cat /etc/hostapd/hostapd.conf 
# Bridge-Betrieb
bridge=br0
#
# Schnittstelle und Treiber
interface=wlan0
#driver=nl80211

# WLAN-Konfiguration
ssid=[redacted]
channel=13
hw_mode=g
ieee80211n=1
ieee80211d=1
country_code=DE
wmm_enabled=1

# WLAN-Verschlüsselung
auth_algs=1
wpa=2
wpa_key_mgmt=WPA-PSK
rsn_pairwise=CCMP
wpa_passphrase=[redacted]

beacon_int=100
cat /etc/network/interfaces
# interfaces(5) file used by ifup(8) and ifdown(8)

# Please note that this file is written to be used with dhcpcd
# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual

auto wlan0
iface wlan0 inet  manual
#    wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

allow-hotplug wlan1
iface wlan1 inet manual
    wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

auto br0
iface br0 inet dhcp
    bridge_ports eth0 wlan0
    bridge_fd 0
    bridge_stp no
    bridge_hw b8:27:eb:d8:1f:72

and dhcpcd has been disabled via systemctl.

I'm setting the br0 HW address manually to the same as eth0 because it always wants to use the MAC from wlan0 and this causes duplicate entries in my fritzbox router. Currently I only have one client using the pi as access point and this is an android phone from LG.

It all works fine when I blacklist the driver for the built in wlan and plug in an external rt2800usb wifi adapter which then will become wlan0 instead of the built in one so I don't need to change any other configuration files.

JamesH65 commented 7 years ago

Are you guys using IPv4 or IPv6?

I've captured a wireshark trace at the point of error on wlan0, and its getting IPv6 router solicitations on what I thought was an entirely IPv4 network.. Am investigating, again.

dickontoo commented 7 years ago

Both. I have a /64 announced to the network, and hosts SLAAC as usual. No DHCPv6 that I'm aware of.

raspiuser123 commented 7 years ago

I have the same problem.

Did one of you tried to compile the brcmfmac driver by yourself?

My Raspberry sometimes run for few days, sometimes one interface stops after few minutes. :-(

JamesH65 commented 7 years ago

It gets built during a kernel build which is what I am doing when adding debug logging to try and diagnose the issue.

On 27 February 2017 at 10:28, raspiuser123 notifications@github.com wrote:

I have the same problem.

Did one of you tried to compile the brcmfmac driver by yourself?

My Raspberry sometimes run for few days, sometimes one interface stops after few minutes. :-(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D282683146&d=DwMCaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=VmISTH8h29c4ehNn_96FGgRQc2vOoiswq1A7H0Ansf8&s=DvjqPnsPY9glHU2FlyiBWx1fOStJztw08cp6yhfd8mk&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHXa5b7D9f2VaSYrPexNgFmT40xPRks5rgqVGgaJpZM4KaDH1&d=DwMCaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=VmISTH8h29c4ehNn_96FGgRQc2vOoiswq1A7H0Ansf8&s=eK2B72ObhTDMBUT6DUY9A_oaXtwX3QwmC3EP7OYOXEo&e= .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

raspiuser123 commented 7 years ago

And how can this problem be solved?

JamesH65 commented 7 years ago

By investigation - find the problem, fix the problem.

I've been looking at it for 2 weeks so far, have made progress, but still not at a root cause.

On 27 February 2017 at 10:37, raspiuser123 notifications@github.com wrote:

And how can this problem be solved?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D282685048&d=DwMCaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=W5DKfp7kahR2VOAnZdbCRyYy97C1TUB6GOmRZ9J7byI&s=5p0_ugwqQrlUH8r805AcQTQr1-f29r2zoL-WctsI_c0&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHe-2DY9z2qUNuLFT1G3K4lO5W6R1fjks5rgqdXgaJpZM4KaDH1&d=DwMCaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=W5DKfp7kahR2VOAnZdbCRyYy97C1TUB6GOmRZ9J7byI&s=8j0YOMcOiOXReWjnnqXAN8lyQ-vdtjkrRyrdUV4Ftu8&e= .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd