raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.18k stars 1.68k forks source link

Ethernet locks up when bridged with wifi #673

Closed dickontoo closed 7 years ago

dickontoo commented 8 years ago

On a Pi 3, the ethernet will randomly lock up when bridged with the wifi interface. This takes the form of:

smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped

after which the wired ethernet stops working. Wifi continues as usual; devices associated with the Pi when in AP mode can ping it and each other, but (obviously) no packets are forwarded to the wired network. This severely limits its use as a wifi AP.

kevent 0 would appear to be EVENT_TX_HALT, which is triggered when the interrupt handler has too much work to do, and hands the processing off to the kworker thread. For some reason, although the worker thread seems to be executing correctly, the condition isn't cleared, probably in the hardware. I've no idea why this bug seems to be tickled by bridging it with the wifi. I've now reached the end of my kernel knowledge.

There's a thread here on the forums.

Thanks.

Noltari commented 7 years ago

@JamesH65 I only need a patch to try the fix :)

JamesH65 commented 7 years ago

Here is a git diff of the propose fix. Has some debugging in it and there is a proper skb function for unclone so not final version.

diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index df60c98..82f618c 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -2076,6 +2076,13 @@ static struct sk_buff *smsc95xx_tx_fixup(struct usbnet *dev,
                        return NULL;
        }

+       if (skb_cloned(skb))
+       {
+               printk(KERN_ERR "Found a cloned skb");
+               if (pskb_expand_head(skb, 8, 0, GFP_ATOMIC))
+                              return NULL;
+       }
+
        if (csum) {
                if (skb->len <= 45) {
                        /* workaround - hardware tx checksum does not work
JamesH65 commented 7 years ago

I think this diff or something very similar will work against various versions down to 4.4, since this area hasn't changed.

dickontoo commented 7 years ago

Is there a simple set of instructions on compiling a Foundation kernel anywhere?

pelwell commented 7 years ago

https://www.raspberrypi.org/documentation/linux/kernel/building.md

dickontoo commented 7 years ago

Ta

dickontoo commented 7 years ago

It's running, but it's awfully chatty. I get pages and pages of 'Found a cloned skb' on boot. I'm going to remove that and try again, as I'm not going to be able to spot any dropped kevents in that noise:

root@tellypi:~# dmesg | grep 'Found a cloned skb' | wc -l
3652
root@tellypi:~# uptime
 10:11:32 up 5 min,  2 users,  load average: 0.05, 0.28, 0.17
root@tellypi:~# 
dickontoo commented 7 years ago

Booted, with no non-BCDC packets reported for the first time in months, so definitely progress.

JamesH65 commented 7 years ago

Odd that there are so many, I was getting one every 10 seconds or so, if that. What is your network topology, how many attached devices etc?

dickontoo commented 7 years ago

ATM, two clients using the Pi as an AP (another Pi in the greenhouse, and my Apple laptop), then a bunch of hosts attached via an Ethernet switch (another Apple laptop, desktop (Debian), router (Debian; serving / to the affected Pi), Z-Wave controller (Raspbian), prehistoric 3Com AP, solar PV inverter). There's also my Android phone, which roams between the two APs depending on where I am in the house.

I expect the NFS root will have something to do with it.

JamesH65 commented 7 years ago

OK, so that a lot of devices, make in rather unpredictable what might be travelling over the network at any one time, but its clearly going to be quite a bit, including lots of broadcasts (the trigger message in my tests) which does explain the number of cloned SKB's you are encountering. Is the phone a Samsung BTW? Some people are reporting major wireless issues with SS phones ( https://github.com/raspberrypi/linux/issues/1342)

Anyone have any reports yet ?

dickontoo commented 7 years ago

There's certainly a fair chunk of ARP on the network; I have a public-facing /28 of IPv4, which gets regular scans from the usual suspects, and quite a few of those addresses are only occasionally used. Plus when the laptop sleeps, the mosh clients running on it will cause the servers to trigger ARPs.

The phone is a 2012 Samsung Galaxy Relay S, running Cyanogenmod. I haven't been seeing the issues in that bug.

husseinj commented 7 years ago

I'm also running into this issue in a project / production environment that I really need to get working asap. Running Raspbian with 4.4.50. My Pi is running a bridge, samba (USB Storage) and dnsmasq. Ethernet is into a Sonos Speaker, WiFi with an old Android tablet and Sonos app. Works really well until 'kevent 0 may have been dropped' and I receive DHCPOFFER and DHCPDISCOVER from the Sonos player but no more DHCPREQUEST or DHCPACK, effectively dropping the device from the wired network.

Trying the patch now.

RobinMcCorkell commented 7 years ago

After being up for nearly 8 hours, I have not seen a single non-BCDC packet warning or kevent error with the patch. At this point I think the BCDC error is fixed, however I would like more data before confirming a fix for the kevent issue.

dickontoo commented 7 years ago

It's looking pretty good here, since applying the patch in the other bug. Give it a couple of weeks, and I'll be quite happy to close this one. Think you've found it, @JamesH65 ...

anthem commented 7 years ago

I've rebuilt kernel 4.9.21-v7+ with your patch and am now running it on all 4 of my RPi3 access points. So far, so good. I too turned off the "Found a cloned skb" message because I was reliably seeing that message every 1-2 seconds. No BCDC messages so far, and everything's stable so far. I'll check in in a couple days or if I encounter any additional failures.

husseinj commented 7 years ago

I'm running 4.9.21 with this patch, dmesg looking good so far I'm going to try to stress this as much as possible over the next few days. Fingers crossed!

RobinMcCorkell commented 7 years ago

Running for 24 hours now, not a single non-BCDC message, dropped kevent, or other Ethernet hang. Looking good! Great job @JamesH65 😄

Minims commented 7 years ago

Running OK too on 4.9.21, but I have many skb in logs like @dickontoo

pi@rpi3-dev:~$ dmesg | grep 'Found a cloned skb' | wc -l
3639
pi@rpi3-dev:~$ 
JamesH65 commented 7 years ago

That's a debugging message that will be removed in final patch. It's there to indicate how often the error could have occured.

On 14 Apr 2017 17:51, "Minims" notifications@github.com wrote:

Running OK too on 4.9.21, but I have many ski in logs like @dickontoo https://github.com/dickontoo

pi@rpi3-dev:~$ dmesg | grep 'Found a cloned skb' | wc -l 3639 pi@rpi3-dev:~$

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-294189492, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHci5hEgP6zHReBM089NGyWLYzBEGks5rv6PpgaJpZM4KaDH1 .

burtyb commented 7 years ago

Patched 4.9.21-v7+ is looking good here too.

Abraham1220 commented 7 years ago

Added the bugfix to a 4.4.50-v7+ kernel. A lot of 'skb log' messages but the issue with the lost LAN connection is gone. On my setup the issue appeared immediately, so the bugfix seems to solve it!

Thanks @JamesH65!

JamesH65 commented 7 years ago

Sounds like good news, thanks for the reports. I'll make some saner patches and send to the Linux kernel netdev mailing list for peer review and assessment.

On 16 Apr 2017 19:59, "Abraham1220" notifications@github.com wrote:

Added the bugfix to a 4.4.50-v7+ kernel. A lot of 'skb log' messages but the issue with the lost LAN connection is gone. On my setup the issue appeared immediately, so the bugfix seems to solve it!

Thanks @JamesH65 https://github.com/JamesH65!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-294368103, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHUGsZkI3tXe6GeijFrTi9A0aD3psks5rwmUrgaJpZM4KaDH1 .

JamesH65 commented 7 years ago

Here is a better (hopefully) patch that uses the correct mechanism to sort out the buffer handling. Could people un-apply the previous patch and try this one? I've actually removed some driver code as I think it is replaced by the functionality in the skb_cow_header call, but this will need some decent testing.

diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index df60c98..7895922 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -2067,6 +2067,12 @@ static struct sk_buff *smsc95xx_tx_fixup(struct usbnet *dev,
        /* We do not advertise SG, so skbs should be already linearized */
        BUG_ON(skb_shinfo(skb)->nr_frags);

+       /* Make writable and expand header space if required */
+       if (skb_cow_head(skb, overhead)) {
+               return NULL;
+       }
+
+       /*
        if (skb_headroom(skb) < overhead) {
                struct sk_buff *skb2 = skb_copy_expand(skb,
                        overhead, 0, flags);
@@ -2075,6 +2081,7 @@ static struct sk_buff *smsc95xx_tx_fixup(struct usbnet *dev,
                if (!skb)
                        return NULL;
        }
+       */

        if (csum) {
                if (skb->len <= 45) {
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c
index a190f53..4940369 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c
@@ -899,6 +899,12 @@ static u8 brcmf_fws_hdrpush(struct brcmf_fws_info *fws, struct sk_buff *skb)
        fillers = round_up(data_offset, 4) - data_offset;
        data_offset += fillers;

+       /* Possible we might receive a cloned skb, if this happens
+        * we must ensure we can write to the header and that we have enough space
+        * TODO - what happens if skb_cow_head fails?
+        */
+       skb_cow_head(skb, data_offset);
+
        skb_push(skb, data_offset);
        wlh = skb->data;

@@ -2100,6 +2106,7 @@ int brcmf_fws_process_skb(struct brcmf_if *ifp, struct sk_buff *skb)
        int rc = 0;

        brcmf_dbg(DATA, "tx proto=0x%X\n", ntohs(eh->h_proto));
+
        /* determine the priority */
        if ((skb->priority == 0) || (skb->priority > 7))
                skb->priority = cfg80211_classify8021d(skb, NULL);
dickontoo commented 7 years ago

Running now.

JamesH65 commented 7 years ago

Sorry, yet another patch to the smsc driver. It's a valid change requested by the linux kernel net devs.

I'm still working on a proper fix for the wireless driver - that code is someone harder to fix properly due to its....nature....

diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index df60c98..f6661e3 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -2067,13 +2067,13 @@ static struct sk_buff *smsc95xx_tx_fixup(struct usbnet *dev,
    /* We do not advertise SG, so skbs should be already linearized */
    BUG_ON(skb_shinfo(skb)->nr_frags);

-   if (skb_headroom(skb) < overhead) {
-       struct sk_buff *skb2 = skb_copy_expand(skb,
-           overhead, 0, flags);
+   /* Make writable and expand header space by overhead if required */
+   if (skb_cow_head(skb, overhead)) {
+       /* Must deallocate here as returning NULL to indicate error
+        * means the skb won't be deallocated in the caller.
+        */
        dev_kfree_skb_any(skb);
-       skb = skb2;
-       if (!skb)
-           return NULL;
+       return NULL;
    }

    if (csum) {
-- 
JamesH65 commented 7 years ago

Looks like this patch has been accepted by the Linux netdevs, so once it is merged upstream we can pull it back to the Pi kernel tree. Looks like the same issue has now been found in at least 6 other drivers unrelated to Pi, and I am pretty sure something similar is in the wireless stack we use, and I am currently trying to sort that out. Thanks for all the testing help. I'll leave this issue open until it is merged.

dickontoo commented 7 years ago

Lovely, thankyou. It's been running cleanly for the last seven hours here, but I haven't really stressed it yet.

c2tb34 commented 7 years ago

Hey yall. I'm late to the conversation, but I was experiencing kevent drops culminating in ethernet deadlocks. My environment is an RPi3B running bleeding edge OpenWRT. I didn't see any kevent drops when running eth0 without wlan0. I was able to consistently reproduce the issue by enabling and bridging wlan0 to eth0 under lan.

I applied both of @JamesH65 's patchs mentioned above, and I am not seeing this issue in my test setup. Prior to his patches, I was seeing this issue almost immediately after bridging. Post patches, I have been running some mild throughput automation and I haven't seen a single instance of the issue (or any non-BCDC packet messages for that matter either).

Thanks for the patches @JamesH65 . I will update quickly if I am able to reproduce the issue post patches or update later with some uptime stats.

JamesH65 commented 7 years ago

Good news, thanks for the report. The BCDC message was a symptom of the issue, but didn't necessarily happen every time. However, it was the reason I was able to track the issue down!

mangodan2003 commented 7 years ago

I've been having this exact same issue for a few weeks and just investigated and found this bug report. How long (days or weeks) do you think it might be before the fixes land via the usual apt-get update / upgrade? Wondering whether to bother setting up cross compiler to build kernel over weekend or if it'll be available betime I get that done and built.

JamesH65 commented 7 years ago

It has just been accepted on the linux netdev tree, so needs to now be merged to the main kernel tree, then backported to our kernel tree. I have NO idea how long that takes.

I reckon it might be quicker to do it yourself for the moment.

On 21 Apr 2017 23:24, "mangodan2003" notifications@github.com wrote:

I've been having this exact same issue for a few weeks and just investigated and found this bug report. How long (days or weeks) do you think it might be before the fixes land via the usual apt-get update / upgrade? Wondering whether to bother setting up cross compiler to build kernel over weekend or if it'll be available betime I get that done and built.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D296319199&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=sHp64HjYVEDN-H790L4DY_J5MKM49zzA8TBJMZ4cmGI&s=rFsXh7BzvUbRTxDlMQaQfqfgj1i_j98igHao33fzUb0&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHYxeOhUQprx0PX4eLVUz-5Fh7oQy-2DFks5rySyEgaJpZM4KaDH1&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=sHp64HjYVEDN-H790L4DY_J5MKM49zzA8TBJMZ4cmGI&s=D7bpYEjJnRYwv_FbRqIm8GfbTi9Ex8uE4CWnoXgYljM&e= .

mangodan2003 commented 7 years ago

Ok. thanks very much for your efforts. Looking forward to a stable pi.

mangodan2003 commented 7 years ago

Only been running 10 mins so far but happy to say seems fixed - or at very least improved. As others said it normally hung shortly after or during boot. interface being accessible for only 10 seconds during boot. I'd then have to log in locally and restart networking to make it work again. I've rebooted a number of times and every time so far it has stayed working. Now for some extended testing.

6by9 commented 7 years ago

Sorting the kernel tree is easy. It's then a case of prodding the right person to get the Raspbian repo updated which will require a fair amount of testing. There's also the small debate over when to switch from 4.4 to 4.9, so there may be a need to backport to both.

popcornmix commented 7 years ago

I've cherry-picked the commit and it should be available with latest rpi-update kernel.

JamesH65 commented 7 years ago

Excellent - thanks @popcornmix.

On 22 April 2017 at 22:23, popcornmix notifications@github.com wrote:

I've cherry-picked the commit and it should be available with latest rpi-update kernel.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-296402464, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHZ7Gu9mL_UPTP-pI7DJ36hfVt0UTks5rym-8gaJpZM4KaDH1 .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

mangodan2003 commented 7 years ago

anyone else seeing corrupted data? not sure if related or a separate bug.

simple example - on raspian : netcat -k -t -l -p 1234

on another host

cat | netcat raspberrypi 1234

keep pasting known line - soon enough itll come through corrupted at the rpi end.

I discovered this with a websockets based home automation messaging service i'm using which kept failing with malformed packets. I wasn't sure if it was a bug in my code that hadn't shown up previously (seemed unlikely as i've been using it with no problem for a few years now on a number of devices) but simple test with netcat seems to suggest its something else.

I've only seen the issue via the wired interface, clients connected via wlan0 seem unaffected.

found this https://www.raspberrypi.org/forums/viewtopic.php?t=178933&p=1140487

popcornmix commented 7 years ago

@mangodan2003 are you testing with latest rpi-update kernel?

mangodan2003 commented 7 years ago

sorry - I should have said.

My pi is uptodate (as of Saturday) with apt-get update and apt-get upgrade

I then checked out 4.9.23-v7 from git, manually made the changes above as were trivial and I couldn't get the patch to apply cleanly the way i'm used to (patch --dry-run -p1 < /path/to/patch - repeat without --dry-run if ok (which it wasn't)).

Then transferred result (kernel image,modules, dtbs) to pi and rebooted.

popcornmix commented 7 years ago

And are you saying the corruption is only present after applying the patch, or did it also occur before the patch?

mangodan2003 commented 7 years ago

I have only tried and found the issue since applying the patch. I can try again with the stock kernel tonight if that helps.

popcornmix commented 7 years ago

Yes, please do. It would also be useful to test with latest rpi-update kernel to rule out any issue you had with manually applying the patch/building the kernel.

JamesH65 commented 7 years ago

Difficult to see how the patch could cause the problem described. I'll try and set something up here to replicate it, probably not today though.

On 24 April 2017 at 12:40, popcornmix notifications@github.com wrote:

Yes, please do. It would also be useful to test with latest rpi-update kernel to rule out any issue you had with manually applying the patch/building the kernel.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-296630200, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHSX8EIFDoSDDKq8u5C-iNGCpvc-Gks5rzIo2gaJpZM4KaDH1 .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

JamesH65 commented 7 years ago

I'm seeing no errors on either 4.4 or the latest 4,9 from rpi-update. Will look again tomorrow.

mangodan2003 commented 7 years ago

I'm unable to reproduce this easily atm. suspect some very specific situation in which it occurs. Yesterday it would happen frequently - may be 1 in 5 lines of text pasted into netcat.

I just cloned my SD card via the ethernet to another (its mounted read-only btw so no corruption issue cloning running FS - and fsck it afterwards to be sure) - I had to restart this process several times (with appropriate skip and seek at each end) as it kept dying with an error about corruption mentioning MAC but i didn't get a note of the exact error but suspect this is also related as not an issue I've had before.

Edit: having done a quick google i beleive this is the msg i was seeing :

Corrupted MAC on input. Packet Corrupt

So I now have a second rpi setup running to test on too. Once I've managed to reliably reproduce the issue ill try both the rpi-update kernel and a build of my current kernel without the above changes and also check its not confined to happening on just the one pi.

popcornmix commented 7 years ago

@mangodan2003 One cause of random corruption would be too high an overclock or an insufficient power supply. Are you overclocking? Does vcgencmd get_throttled return non-zero after a failure? Obviously there are other possibilities but we should rule out the obvious ones first.

mangodan2003 commented 7 years ago

not overclocked at all - never have been - not something i'm interested in these days. Running from 3Amp PSU.

Both pis are presently giving loads of errors - keep getting kicked out of ssh sessions because of it, rpi-update failed several times to download.

pi@raspberrypi:~ $ vcgencmd get_throttled throttled=0x0 Issue persists running the rpi-update kernel - not bothered to do a build without the above patch as rpi-update kernel appears to be the same thing.

As not related to above changes anything else i find i'll post in the forum linked above re mac spoofing.

I had wondered if it only happened whilst my active ssh sessions were via the wlan0 of the rpi being the AP but I ruled this out by logging into another host that is only connected to the other rpi (currently with no associated clients) via ethernet and setting it to regularly send strings. They also become corrupted from time to time.

popcornmix commented 7 years ago

rpi-update does include the patch that fixes the bridged wifi issue. If you want to test without this patch then: sudo rpi-update 06c104c37348e104d7bc108b8ad19697df93b589 will get last kernel before the patch.

Testing: sudo BRANCH=stable rpi-update will get you the stable 4.4 kernel. That would be another interesting data point for your netcat test.

mangodan2003 commented 7 years ago

ah woops - That should have been obvious when i rebooted and I didn't have to go and plug HDMI and keyboard in to restart networking. Having just tried I am still getting issues but also just checked and the PSU on the this clone RPi is only 2Amp, and I am seeing :

pi@lime:~ $ vcgencmd get_throttled throttled=0x50000

I have not checked what that means but am popping out for a bit now. Ill get back to it later. There is nothing other than the keyboard plugged in to need power - not sure if 2Amp is sufficient on the RPi 3 or not in that scenario.

Edit: 0x50000 means it has had "under voltage" and "throttled" conditions so ill try on this one again with a better PSU. however the other (original) one with 3Amp PSU has no bits set and uptime of nearly 50 hours now so am happy there is not a power issue there.