raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.15k stars 1.68k forks source link

Ethernet locks up when bridged with wifi #673

Closed dickontoo closed 7 years ago

dickontoo commented 7 years ago

On a Pi 3, the ethernet will randomly lock up when bridged with the wifi interface. This takes the form of:

smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped

after which the wired ethernet stops working. Wifi continues as usual; devices associated with the Pi when in AP mode can ping it and each other, but (obviously) no packets are forwarded to the wired network. This severely limits its use as a wifi AP.

kevent 0 would appear to be EVENT_TX_HALT, which is triggered when the interrupt handler has too much work to do, and hands the processing off to the kworker thread. For some reason, although the worker thread seems to be executing correctly, the condition isn't cleared, probably in the hardware. I've no idea why this bug seems to be tickled by bridging it with the wifi. I've now reached the end of my kernel knowledge.

There's a thread here on the forums.

Thanks.

raspiuser123 commented 7 years ago

Ok, thanks, then I will use another wifi device in meantime.

premysljordak commented 7 years ago

I have exactly same problem. Humbly waiting for solution. Thank you!

raspiuser123 commented 7 years ago

I tried to reduce the MTU from eth0 to 1400 (wlan0 has MTU of 1500) and now the eth0 runs stable. ;-) At moment there are no more "smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped" entries.

premysljordak commented 7 years ago

Hey raspiuser 123. Thank you for your advice. Works for me now for 2+ hours. Added

iface eth0 inet manual
    post-up ifconfig eth0 mtu 1492

to /etc/network/interfaces

premysljordak commented 7 years ago

Well, after a night of connected Android phone with no problem my wife used Windows 7 notebook and now I am back to the dropped kevent :(

prof7bit commented 7 years ago

I have noticed something I do not understand: I have played with the MTU settings and no matter whether I set the MTU of eth0 or wlan0 or both to 1492 I still can ping my mobile wlan device from the ethernet side through this bridge with a 1500 packet:

ping -s 1472 -M do android-bernd.fritz.box
PING android-bernd.fritz.box (10.0.1.3) 1472(1500) bytes of data.
1480 bytes from android-bernd.fritz.box (10.0.1.3): icmp_seq=1 ttl=64 time=196 ms

(It is not accidentally attached to the fritzbox, its attached to the Raspi access point)

Shouldn't I be supposed to receive "Frag needed and DF set (mtu = 1492)" when the bridge has a smaller MTU? How can it still forward packets larger than its own MTU? Might the ignoring of the MTU sizes and forwarding oversized packets be related to the problem?

Minims commented 7 years ago

I've tried to set mtu 1456 and 1400 on eth0, but I've still have same problem too in less than 1 hour.

dickontoo commented 7 years ago

Try setting it on the bridge device, not the component interfaces?

Minims commented 7 years ago

Same issue on br0. Interfaces non reacheable after 2 hours :-/

JamesH65 commented 7 years ago

I presume you are seeing the kevent dropped message in dmesg?

On 10 March 2017 at 23:09, Minims notifications@github.com wrote:

Same issue on br0. Interfaces non reacheable after 2 hours :-/

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D285810259&d=DwMCaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=H51fdGtr23tuTjt84-w4E6D63Vd-ngPQAZ5Quu1c1JM&s=8uBeuGIwA-B_a-ZTNdnUYmqOOqcQlf18y4ghUkvK-zI&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHTc-2D5qn7SU5xyldjgXAAezW5ssG2ks5rkdgogaJpZM4KaDH1&d=DwMCaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=H51fdGtr23tuTjt84-w4E6D63Vd-ngPQAZ5Quu1c1JM&s=eWTfPLyAXiw2UZw5V2rLryTlahCPi5RJewlogpS-PeU&e= .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

Minims commented 7 years ago

Yes always this error

Noltari commented 7 years ago

On LEDE chainging the mtu has no effect at all. After bridging ethernet & wifi, kevent has been dropped errors will trigger after a single speed test with a single computer connected via WiFi to the RPi3.

Abraham1220 commented 7 years ago

Setup a new RPI3 (Raspbian Lite) with bridge, static IP, disabled dhcpcd and enabled networking. Immediately when I connect one WLAN client a get the "kevent 0 may have been dropped" in syslog and dmesg and loose LAN availability. Is there an old version which works? Does someone know? As I require the pi in an "productive environment" I cannot test a lot with it ...

dickontoo commented 7 years ago

This bug has been present in kernel images and firmware since the launch of the Pi3; to the best of my knowledge there have been no images which have proved completely stable. I think I've tried most of them in the intervening 12 months or so.

That said, the Perl script above does a reasonable job if you don't mind losing connectivity for a few seconds each time. Otherwise, you'll have to wait for JamesH65 to find and fix it (or do so yourself, of course...).

At present, the Pi3 is not suitable for use as a Wifi AP in production environments using the in-built hardware. You may find a USB ethernet adaptor works better.

JamesH65 commented 7 years ago

I'm open to help on this one! I also believe it has been there for some time, but is more prevalent on the wireless models.

On 13 March 2017 at 18:08, Dickon Hood notifications@github.com wrote:

This bug has been present in kernel images and firmware since the launch of the Pi3; to the best of my knowledge there have been no images which have proved completely stable. I think I've tried most of them in the intervening 12 months or so.

That said, the Perl script above does a reasonable job if you don't mind losing connectivity for a few seconds each time. Otherwise, you'll have to wait for JamesH65 to find and fix it (or do so yourself, of course...).

At present, the Pi3 is not suitable for use as a Wifi AP in production environments using the in-built hardware. You may find a USB ethernet adaptor works better.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D286193766&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=oVqGIhQcc1As7VtkqSJ_MnmEYZ82M_FMg_NSiC1AaqY&s=KqvVofv2okA5TZruOPaF5ejvaPvvHYbskOfjCDH0Wck&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHcVn326ZDzfOo1s-5F8lzVhRg3Ubp1ks5rlYYSgaJpZM4KaDH1&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=oVqGIhQcc1As7VtkqSJ_MnmEYZ82M_FMg_NSiC1AaqY&s=NQwXAvoexj9B-xsxd93wtQnpAVH9fgxKq39YhLKsl0Q&e= .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

Minims commented 7 years ago

@dickontoo

At present, the Pi3 is not suitable for use as a Wifi AP in production environments using the in-built hardware.

You mean in bridge mode, because I use it as AP with dnsmasq to provide dhcp and it works well. The problem only appears with bridge mode for me.

Minims commented 7 years ago

@JamesH65 is there some bridge options we can use to prevent this failure ?

dickontoo commented 7 years ago

If you're not bridging, you're not a true access point: APs media-convert between wifi and wired networks. A dnsmasq setup of the sort I think you're talking about doesn't do this (it'll setup one network on the wifi and route to the wired, at which point it's a router (or hotspot) not an AP). A true AP will treat both wifi and wired as the same network, which allows you to have multiple APs on the same network and allow transparent roaming between APs, for example). See the Wikipedia entry for details.

Yes, the Pi3 will be fine in such configurations. But that isn't a wifi AP.

ATM, I have an old 3Com AP at the front of the house (where I don't use wifi all that much), and a Pi3 being troublesome at the back (where I do). I can roam between them, carrying the same IP addresses with me, more or less entirely transparently (there's a bit of time where the client hunts for another suitable BSSID, but connections stay up). I don't believe this is possible with a setup you've described.

Minims commented 7 years ago

You're right I can't do such a thing, that's why I wan't to bridge wlan0 & eth0. Sorry for the confusion for the AP definition :-).

Abraham1220 commented 7 years ago

@JamesH65. I am happy to assist and support you to identify the root cause. I am not familiar in the Linux environment with debugging and/or development therefore I rely on your expertise. But when I can do something for you please let me know.

kr/m

JamesH65 commented 7 years ago

I've been back on this for the last day, but am having trouble replicating with the 4.9 kernel. Has anyone above tried the 4.9 kernel, and still sees the issue?

4.9 kernel has quite a few differences in the Broadcom wireless driver, including some in the areas I have been looking at. It's quite possible one of these changes has fixed the issue. I'm not seeing any wireshark errors at all that I was previously getting regularly.

dickontoo commented 7 years ago

There's a 4.9? I'm only seeing 4.4.50.

pelwell commented 7 years ago

If you run sudo rpi-update you will get a 4.9 kernel, but Raspbian is still on 4.4.

dickontoo commented 7 years ago

I thought we weren't supposed to do that any more...

pelwell commented 7 years ago

We advise people who don't know what they are doing and understand the risks that randomly updating to the latest kernel may not be in their best interests, but we rely on some people trying it to improve our test coverage. On this occasion we have a good reason to ask you to upgrade - it's up to you if you want to take the risk, but I've been running 4.9 happily for a while now, and it won't be long before Raspbian makes the switch.

Noltari commented 7 years ago

@JamesH65 nope, still happens on 4.9 with LEDE.

dickontoo commented 7 years ago

Not much of a risk for me. I note that rpi-update doesn't regenerate the initramfs, though...

pelwell commented 7 years ago

I note that rpi-update doesn't regenerate the initramfs, though...

That's correct. We are considering some initramfs-related changes though - it might be worth creating an issue about it.

JamesH65 commented 7 years ago

OK, still seems to be there according to Alvaro. I do have a band-aid fix that I think might make the problem a lot rarer, but it cures the symptom, not the cause.

Here's a patch for the 4.9 branch (might work on older ones too) diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index d5071e3..e4da362 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -457,8 +457,11 @@ static enum skb_state defer_bh(struct usbnet dev, struct sk_buff skb, / void usbnet_defer_kevent (struct usbnet dev, int work) {

On 30 March 2017 at 15:52, Phil Elwell notifications@github.com wrote:

I note that rpi-update doesn't regenerate the initramfs, though...

That's correct. We are considering some initramfs-related changes though - it might be worth creating an issue about it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-290436000, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHcVuDNSMuE80N6YHO3l2gz3JzKoIks5rq8HLgaJpZM4KaDH1 .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

JamesH65 commented 7 years ago

Here is a kernel image with the band aid, needs 4.9 modules. Would appreciate if someone could test on a system they know exhibits the problem.

https://www.dropbox.com/s/8687t0b8uhi2pjg/kernel7.img?dl=0

TIA

dickontoo commented 7 years ago

I've finally got the thing booted -- it's 4.9.17-v7+, whereas rpi-update installs 4.9.19-v7+ so the modules aren't in the right place in the initrd -- and will let you know.

JamesH65 commented 7 years ago

I'l try and update my tree tomorrow to make it closer to the rpi-update version - although its only two days old so not sure where the version difference has come from.

On 30 March 2017 at 19:00, Dickon Hood notifications@github.com wrote:

I've finally got the thing booted -- it's 4.9.17-v7+, whereas rpi-update installs 4.9.19-v7+ so the modules aren't in the right place in the initrd -- and will let you know.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-290492485, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHSlYNiEbgNKwL0IHvwuFzu8aGQ_Uks5rq-2ngaJpZM4KaDH1 .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

dickontoo commented 7 years ago

0632 this morning:

[Mar31 06:32] nfs: server 172.29.23.1 not responding, timed out [ +16.399958] calling stall drivers/usb/host/dwc_otg/dwc_otg_hcdintr.c handle [ +0.000286] Have a usbnet deferred kevent 0 (roughly 50 of those pairs over the course of three minutes; 0635:) [ +0.018653] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped

and the Perl script above kicks in.

JamesH65 commented 7 years ago

Hmm. Did it last longer than normal? I've been running overnight, I get the deferred statement, but no drops.

On 31 March 2017 at 09:26, Dickon Hood notifications@github.com wrote:

0632 this morning:

[Mar31 06:32] nfs: server 172.29.23.1 not responding, timed out [ +16.399958] calling stall drivers/usb/host/dwc_otg/dwc_otg_hcdintr.c handle [ +0.000286] Have a usbnet deferred kevent 0 (roughly 50 of those pairs over the course of three minutes; 0635:) [ +0.018653] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped

and the Perl script above kicks in.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D290649518&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=otEOStr-pyVvreeXW1opjsnFkK_xbDSzlOics2jPqpI&s=tkeDVRKVL-0wMjsS-_2SBpvIeEk_V--mwxhcI_zCqBY&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHRS-5FdY0Mohe0toDbpZu5DWMWZ1RTks5rrLiYgaJpZM4KaDH1&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=otEOStr-pyVvreeXW1opjsnFkK_xbDSzlOics2jPqpI&s=okrCq7yrwTQraJUNtVwn0NjLQw2c8E-n4EvDzrhgVH0&e= .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

dickontoo commented 7 years ago

Until I rebooted it yesterday (to run your new kernel) it'd been running fine for a couple of weeks.

JamesH65 commented 7 years ago

OK. I'll keep digging. It seems my mechanism to prod the error to happen is no longer prodding which does make things difficult to track down.

On 31 March 2017 at 09:40, Dickon Hood notifications@github.com wrote:

Until I rebooted it yesterday (to run your new kernel) it'd been running fine for a couple of weeks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D290652707&d=DwMCaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=hWycREqqNEAxFYL3uqD7Xl-6ktHt8XOJlf2iEDCZ-9U&s=sPIIQT1_AaaC736BUQODeKOGM7csYPC0IbaHmDGDaqY&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHcJS6n3RAWJWidqXD9FgoeDIhx7Dks5rrLvzgaJpZM4KaDH1&d=DwMCaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=hWycREqqNEAxFYL3uqD7Xl-6ktHt8XOJlf2iEDCZ-9U&s=zy781rdhkTODzx0wddBijFLolzscNP9Hk1lSH2FUsws&e= .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

JamesH65 commented 7 years ago

Cam I ask those that are seeing the issues ( I can no longer replicate it) give me some idea of what sort of traffic is going over the bridge? Previously I have been sending a load of data from a wireless connected device, via the APP/Bridge to a slave device on the ethernet. What set the issue off during the tress test was running DHCPCD on the AP, this gave the dropped packet every time. Now it doesn't. So I need some more detailed information on what sort of traffic is going over the wireless to find another way of replicated the issue. Meanwhile I am going to revert back to earlier kernel versions to try and get my old mechanism back.

dickontoo commented 7 years ago

In my case, the usual background noise (dhcp, arp, (m)DNS, IPv6 neighbour discovery, etc.) with the usual sort of HTTP(S) nonsense Android and Apple devices cause, plus a lot of mosh (UDP), a lot of NFS when watching TV, some ssh, and some raw TCP connections that just spit out environmental data (temperatures and whatnot). I've been unable to spot any actual trigger packet.

Of course, as is normal when trying to debug sporadic issues, it's been running fine all weekend and it hasn't dropped since I reported on Friday.

JamesH65 commented 7 years ago

Wild stab in the dark, given the experiments done with DHCPCD above, but if people put

denyinterfaces eth0 denyinterfaces wlan0

at the bottom of their /etc/dhcpcd.conf file, does that make any difference?

JamesH65 commented 7 years ago

Just a quick placeholder for the setup I am using on my access point, so I don't lose the details. This setup does, infrequently, give kevent dropped errors, whilst being slightly stressed by continual scp's over the bridge. Host Pi->AP Pi->SlavePi->AP Pi->Host Pi

/etc/network/interfaces

# interfaces(5) file used by ifup(8) and ifdown(8)

# Please note that this file is written to be used with dhcpcd
# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual

auto wlan0
iface wlan0 inet manual

allow-hotplug wlan1
iface wlan1 inet manual
    wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

auto br0
iface br0 inet static
address 192.168.0.1
netmask 255.255.255.0
bridge_ports eth0 wlan0

/etc/dnsmasq.conf

interface=br0
dhcp-range=192.168.0.2,192.168.0.19,255.255.255.0,5000h

/etc/dhcpcd.conf

# A sample configuration for dhcpcd.
# See dhcpcd.conf(5) for details.

# Allow users of this group to interact with dhcpcd via the control socket.
#controlgroup wheel

# Inform the DHCP server of our hostname for DDNS.
hostname

# Use the hardware address of the interface for the Client ID.
clientid
# or
# Use the same DUID + IAID as set in DHCPv6 for DHCPv4 ClientID as per RFC4361.
#duid

# Persist interface configuration when dhcpcd exits.
persistent

# Rapid commit support.
# Safe to enable by default because it requires the equivalent option set
# on the server to actually work.
option rapid_commit

# A list of options to request from the DHCP server.
option domain_name_servers, domain_name, domain_search, host_name
option classless_static_routes
# Most distributions have NTP support.
option ntp_servers
# Respect the network MTU.
# Some interface drivers reset when changing the MTU so disabled by default.
#option interface_mtu

# A ServerID is required by RFC2131.
require dhcp_server_identifier

# Generate Stable Private IPv6 Addresses instead of hardware based ones
slaac private

# A hook script is provided to lookup the hostname if not set by the DHCP
# server, but it should not be run by default.
nohook lookup-hostname

denyinterfaces wlan0
denyinterfaces eth0

/etc/hostapd/hostapd.conf

interface=wlan0
bridge=br0
ssid=TestDoNotUse2
hw_mode=g
channel=7
wmm_enabled=0
macaddr_acl=0
auth_algs=1
ignore_broadcast_ssid=0
wpa=2
wpa_passphrase=JamesTest
wpa_key_mgmt=WPA-PSK
wpa_pairwise=TKIP
rsn_pairwise=CCMP
ieee80211n=1
JamesH65 commented 7 years ago

One thing I have noticed, the brcmfmac: brcmf_proto_bcdc_hdrpull: wlan0: non-BCDC packet received, flags 0x7e error happens at a very consistent frequency, of between 64 and 65 seconds. Not sure what network traffic happens at that frequency though. Going to stick wireshark on for more investigation.

EDIT TO ADD: This message is caused by the dhcpcd client - top it, the message goes away, restart it, it comes back. Reading back through the above message does imply this. Sorry about the noise.

RobinMcCorkell commented 7 years ago

Since getting to the root cause seems to be difficult, perhaps a good first step would be to enable the driver to recover from a dropped kevent? At the moment the driver entirely stops working, but it seems to me that what should really happen is the faulty packet/packets get dropped, but the driver recovers and continues processing as normal. Perhaps implementing that is a smaller task than trying to debug the root cause of the dropped kevents.

JamesH65 commented 7 years ago

I've already tried a better mechanism to handle to kevents at a high level - moving to a dedicated workqueue rather than the system queue which can cause delays when the stalls happen too close together. I suspect it will improve the situation, but testers found it didn't fix it completely. I am by no means an expert in this stuff, I'm learning it as I go along, so I would no more idea on how to fix it in the driver than trying to fix the root issue.

There is an issue in the linux tree that I think is the same with some valid comments that I will look in to.

https://github.com/raspberrypi/linux/issues/1342#issuecomment-292040671

Abraham1220 commented 7 years ago

@JamesH65, I have now tested the kernel 4.9.20-v7+with my productive environment. Immediately after I activated the hostapd WLAN bridge my LAN Adapter lost connection (there wasn't yet a client linked to the WLAN). With the old kernel only the LAN connection was lost when a client connected to the WLAN. As already mentioned I also receive now in dmesg and syslog an additional msg regarding brcmfmac. See log below.

FYI, I have some ICMP ping and UDP traffic on the LAN port running all the time. This seems to bring up the issue very fast.

Apr 9 09:58:17 asterix kernel: [ 202.136415] br0: port 2(wlan0) entered blocking state Apr 9 09:58:17 asterix kernel: [ 202.136426] br0: port 2(wlan0) entered disabled state Apr 9 09:58:17 asterix kernel: [ 202.137220] device wlan0 entered promiscuous mode Apr 9 09:58:18 asterix kernel: [ 202.461555] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready Apr 9 09:58:18 asterix kernel: [ 202.461663] br0: port 2(wlan0) entered blocking state Apr 9 09:58:18 asterix kernel: [ 202.461667] br0: port 2(wlan0) entered forwarding state Apr 9 09:58:20 asterix avahi-daemon[665]: Joining mDNS multicast group on interface wlan0.IPv6 with address fe80::ba27:ebff:feea:9ee. Apr 9 09:58:20 asterix avahi-daemon[665]: New relevant interface wlan0.IPv6 for mDNS. Apr 9 09:58:20 asterix avahi-daemon[665]: Registering new address record for fe80::ba27:ebff:feea:9ee on wlan0.*. Apr 9 09:58:21 asterix ntpd[772]: Listen normally on 6 wlan0 fe80::ba27:ebff:feea:9ee UDP 123 Apr 9 09:58:21 asterix ntpd[772]: peers refreshed Apr 9 09:58:22 asterix kernel: [ 206.710049] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped Apr 9 09:58:35 asterix kernel: [ 219.925775] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped Apr 9 09:58:36 asterix kernel: [ 220.965834] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped Apr 9 09:58:55 asterix hostapd: wlan0: STA 10:08:b1:e2:a6:77 IEEE 802.11: associated Apr 9 09:58:55 asterix hostapd: wlan0: STA 10:08:b1:e2:a6:77 RADIUS: starting accounting session 58E9E99A-00000000 Apr 9 09:58:55 asterix hostapd: wlan0: STA 10:08:b1:e2:a6:77 WPA: pairwise key handshake completed (RSN) Apr 9 09:59:07 asterix kernel: [ 251.848820] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped Apr 9 09:59:07 asterix kernel: [ 251.848890] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped Apr 9 09:59:07 asterix kernel: [ 251.848924] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped Apr 9 09:59:09 asterix kernel: [ 253.928965] brcmfmac: brcmf_proto_bcdc_hdrpull: wlan0: non-BCDC packet received, flags 0x46 Apr 9 09:59:20 asterix kernel: [ 264.645803] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped Apr 9 09:59:20 asterix kernel: [ 264.965788] smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped

RobinMcCorkell commented 7 years ago

I've started noticing these issues even without a kevent 0 may have been dropped message. I still get those messages (from which the Perl script can recover the Ethernet) but I've now at least twice seen the Ethernet adapter lock up in exactly the same way, but without any enlightening messages in dmesg.

Kernel 4.9.17-1-ARCH (Arch Linux). These are the things I see immediately prior to the adapter cutting out:

[417544.442828] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36
[417669.786558] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36
[417794.724138] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36
[417919.868076] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36
[418044.408549] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36
[418170.109179] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36
[418295.069812] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36
[418419.710732] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36
[418544.911030] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36

A driver rebind fixes the issue. I will investigate other ways of detecting dropped Ethernet, in order to create a more generic script that can also recover from this issue.

JamesH65 commented 7 years ago

I'm currently trying to track down that bcdc message, started to make progress, slowly! There's an internal message sent where the first four bytes seem to get corrupted. Not sure if connected to this issue but there is some correlation.

On 10 Apr 2017 16:36, "Robin McCorkell" notifications@github.com wrote:

I've started noticing these issues even without a kevent 0 may have been dropped message. I still get those messages (from which the Perl script can recover the Ethernet) but I've now at least twice seen the Ethernet adapter lock up in exactly the same way, but without any enlightening messages in dmesg.

Kernel 4.9.17-1-ARCH (Arch Linux). These are the things I see immediately prior to the adapter cutting out:

[417544.442828] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36 [417669.786558] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36 [417794.724138] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36 [417919.868076] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36 [418044.408549] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36 [418170.109179] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36 [418295.069812] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36 [418419.710732] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36 [418544.911030] brcmfmac: brcmf_proto_bcdc_hdrpull: wlxb827eb57290e: non-BCDC packet received, flags 0x36

A driver rebind fixes the issue. I investigate other ways of detecting dropped Ethernet, in order to create a more generic script that can also recover from this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D292987479&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=GTNcv4-kKpLO5hfPGrzMF-xGGExFPz0ip-0imRemeLE&s=fXOo_Z6RgJmeLVsJv1dVTTbKDcRtDa4AWiiH6NVfxyY&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHcZZQ9sP5qnImsI5DwTlFamL-2DwAfks5rukx7gaJpZM4KaDH1&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=GTNcv4-kKpLO5hfPGrzMF-xGGExFPz0ip-0imRemeLE&s=uGQTdYNQHkxMTZTFCow-5hPKPRxk3oPQIwN8JyVsa2Q&e= .

JamesH65 commented 7 years ago

OK, some progress to report. After much digging and logging we have found the cause of the BCDC error message. There is a fault in the smsc95x driver and possibly the brcm wireless driver as well in its handling of cloned skb buffers. When bridged, a broadcast packet is cloned in the bridge layer and the packets passed on to the ethernet AND the wireless driver for sending out. A cloned buffer has different skb_buff structures, but both point to the same physical data. If the driver does not handle this correctly and writes to the buffer, this actually affects the same physical data that is also referenced by the cloned buffer in any other driver using the cloned buffer. Two drivers, both using the same data buffer is bad news if one has altered the data in any way. Which is what is happening here. The smsc driver was changing the data, which was wrecking the formatting in the buffer which was being used in the wireless driver. Thanks to @6by9 we have a two line fix which makes the BCDC error go away. We still need to check the wireless driver for similar behaviour.

What I don't know is if this also fixes this issue (I have no way to replicate at the moment), although I am hopeful. Would someone be amenable to trying the fix? If so, would a patch be sufficient to test or would you need new kernel and modules? I'm going to post the fix as a patch tomorrow when I get back in to work, but if more is required please let me know.

JamesH65 commented 7 years ago

Worth noting that cloned buffers have an internal flag that the drivers can check to see if they are cloned, so if they need to change the buffer they can be uncloned which effectively make a new copy of the buffer. This is what the smsc driver was not doing. It may also be doing it in other places as well, as might he wireless driver.This sort of error has the ability to cause chaos in lots of different areas, so i'm going to have a good look around these drivers for any areas where this sort of things might happen. We have a couple of issues outstanding where WIfi stops working. This could be the culprit.

RobinMcCorkell commented 7 years ago

@JamesH65 Nice! I'm actually in the position to help test a patch, so if you share the diff (against 4.9 please) then I'll test it and report back. My device seems to be hitting the issue fairly frequently, so with a bit of luck we can quickly verify if the patch is insufficient, or if it is plausible that it fixed the problem.

dickontoo commented 7 years ago

I currently don't have a suitable ARM cross environment or Pi sources, so binaries would be better for me. Very happy to test, though.