Closed dickontoo closed 7 years ago
@JamesH65 I only need a patch to try the fix :)
Here is a git diff of the propose fix. Has some debugging in it and there is a proper skb function for unclone so not final version.
diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index df60c98..82f618c 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -2076,6 +2076,13 @@ static struct sk_buff *smsc95xx_tx_fixup(struct usbnet *dev,
return NULL;
}
+ if (skb_cloned(skb))
+ {
+ printk(KERN_ERR "Found a cloned skb");
+ if (pskb_expand_head(skb, 8, 0, GFP_ATOMIC))
+ return NULL;
+ }
+
if (csum) {
if (skb->len <= 45) {
/* workaround - hardware tx checksum does not work
I think this diff or something very similar will work against various versions down to 4.4, since this area hasn't changed.
Is there a simple set of instructions on compiling a Foundation kernel anywhere?
Ta
It's running, but it's awfully chatty. I get pages and pages of 'Found a cloned skb' on boot. I'm going to remove that and try again, as I'm not going to be able to spot any dropped kevents in that noise:
root@tellypi:~# dmesg | grep 'Found a cloned skb' | wc -l
3652
root@tellypi:~# uptime
10:11:32 up 5 min, 2 users, load average: 0.05, 0.28, 0.17
root@tellypi:~#
Booted, with no non-BCDC packets reported for the first time in months, so definitely progress.
Odd that there are so many, I was getting one every 10 seconds or so, if that. What is your network topology, how many attached devices etc?
ATM, two clients using the Pi as an AP (another Pi in the greenhouse, and my Apple laptop), then a bunch of hosts attached via an Ethernet switch (another Apple laptop, desktop (Debian), router (Debian; serving / to the affected Pi), Z-Wave controller (Raspbian), prehistoric 3Com AP, solar PV inverter). There's also my Android phone, which roams between the two APs depending on where I am in the house.
I expect the NFS root will have something to do with it.
OK, so that a lot of devices, make in rather unpredictable what might be travelling over the network at any one time, but its clearly going to be quite a bit, including lots of broadcasts (the trigger message in my tests) which does explain the number of cloned SKB's you are encountering. Is the phone a Samsung BTW? Some people are reporting major wireless issues with SS phones ( https://github.com/raspberrypi/linux/issues/1342)
Anyone have any reports yet ?
There's certainly a fair chunk of ARP on the network; I have a public-facing /28 of IPv4, which gets regular scans from the usual suspects, and quite a few of those addresses are only occasionally used. Plus when the laptop sleeps, the mosh clients running on it will cause the servers to trigger ARPs.
The phone is a 2012 Samsung Galaxy Relay S, running Cyanogenmod. I haven't been seeing the issues in that bug.
I'm also running into this issue in a project / production environment that I really need to get working asap. Running Raspbian with 4.4.50. My Pi is running a bridge, samba (USB Storage) and dnsmasq. Ethernet is into a Sonos Speaker, WiFi with an old Android tablet and Sonos app. Works really well until 'kevent 0 may have been dropped' and I receive DHCPOFFER and DHCPDISCOVER from the Sonos player but no more DHCPREQUEST or DHCPACK, effectively dropping the device from the wired network.
Trying the patch now.
After being up for nearly 8 hours, I have not seen a single non-BCDC packet warning or kevent error with the patch. At this point I think the BCDC error is fixed, however I would like more data before confirming a fix for the kevent issue.
It's looking pretty good here, since applying the patch in the other bug. Give it a couple of weeks, and I'll be quite happy to close this one. Think you've found it, @JamesH65 ...
I've rebuilt kernel 4.9.21-v7+ with your patch and am now running it on all 4 of my RPi3 access points. So far, so good. I too turned off the "Found a cloned skb" message because I was reliably seeing that message every 1-2 seconds. No BCDC messages so far, and everything's stable so far. I'll check in in a couple days or if I encounter any additional failures.
I'm running 4.9.21 with this patch, dmesg looking good so far I'm going to try to stress this as much as possible over the next few days. Fingers crossed!
Running for 24 hours now, not a single non-BCDC message, dropped kevent, or other Ethernet hang. Looking good! Great job @JamesH65 😄
Running OK too on 4.9.21, but I have many skb in logs like @dickontoo
pi@rpi3-dev:~$ dmesg | grep 'Found a cloned skb' | wc -l
3639
pi@rpi3-dev:~$
That's a debugging message that will be removed in final patch. It's there to indicate how often the error could have occured.
On 14 Apr 2017 17:51, "Minims" notifications@github.com wrote:
Running OK too on 4.9.21, but I have many ski in logs like @dickontoo https://github.com/dickontoo
pi@rpi3-dev:~$ dmesg | grep 'Found a cloned skb' | wc -l 3639 pi@rpi3-dev:~$
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-294189492, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHci5hEgP6zHReBM089NGyWLYzBEGks5rv6PpgaJpZM4KaDH1 .
Patched 4.9.21-v7+ is looking good here too.
Added the bugfix to a 4.4.50-v7+ kernel. A lot of 'skb log' messages but the issue with the lost LAN connection is gone. On my setup the issue appeared immediately, so the bugfix seems to solve it!
Thanks @JamesH65!
Sounds like good news, thanks for the reports. I'll make some saner patches and send to the Linux kernel netdev mailing list for peer review and assessment.
On 16 Apr 2017 19:59, "Abraham1220" notifications@github.com wrote:
Added the bugfix to a 4.4.50-v7+ kernel. A lot of 'skb log' messages but the issue with the lost LAN connection is gone. On my setup the issue appeared immediately, so the bugfix seems to solve it!
Thanks @JamesH65 https://github.com/JamesH65!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-294368103, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHUGsZkI3tXe6GeijFrTi9A0aD3psks5rwmUrgaJpZM4KaDH1 .
Here is a better (hopefully) patch that uses the correct mechanism to sort out the buffer handling. Could people un-apply the previous patch and try this one? I've actually removed some driver code as I think it is replaced by the functionality in the skb_cow_header call, but this will need some decent testing.
diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index df60c98..7895922 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -2067,6 +2067,12 @@ static struct sk_buff *smsc95xx_tx_fixup(struct usbnet *dev,
/* We do not advertise SG, so skbs should be already linearized */
BUG_ON(skb_shinfo(skb)->nr_frags);
+ /* Make writable and expand header space if required */
+ if (skb_cow_head(skb, overhead)) {
+ return NULL;
+ }
+
+ /*
if (skb_headroom(skb) < overhead) {
struct sk_buff *skb2 = skb_copy_expand(skb,
overhead, 0, flags);
@@ -2075,6 +2081,7 @@ static struct sk_buff *smsc95xx_tx_fixup(struct usbnet *dev,
if (!skb)
return NULL;
}
+ */
if (csum) {
if (skb->len <= 45) {
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c
index a190f53..4940369 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c
@@ -899,6 +899,12 @@ static u8 brcmf_fws_hdrpush(struct brcmf_fws_info *fws, struct sk_buff *skb)
fillers = round_up(data_offset, 4) - data_offset;
data_offset += fillers;
+ /* Possible we might receive a cloned skb, if this happens
+ * we must ensure we can write to the header and that we have enough space
+ * TODO - what happens if skb_cow_head fails?
+ */
+ skb_cow_head(skb, data_offset);
+
skb_push(skb, data_offset);
wlh = skb->data;
@@ -2100,6 +2106,7 @@ int brcmf_fws_process_skb(struct brcmf_if *ifp, struct sk_buff *skb)
int rc = 0;
brcmf_dbg(DATA, "tx proto=0x%X\n", ntohs(eh->h_proto));
+
/* determine the priority */
if ((skb->priority == 0) || (skb->priority > 7))
skb->priority = cfg80211_classify8021d(skb, NULL);
Running now.
Sorry, yet another patch to the smsc driver. It's a valid change requested by the linux kernel net devs.
I'm still working on a proper fix for the wireless driver - that code is someone harder to fix properly due to its....nature....
diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index df60c98..f6661e3 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -2067,13 +2067,13 @@ static struct sk_buff *smsc95xx_tx_fixup(struct usbnet *dev,
/* We do not advertise SG, so skbs should be already linearized */
BUG_ON(skb_shinfo(skb)->nr_frags);
- if (skb_headroom(skb) < overhead) {
- struct sk_buff *skb2 = skb_copy_expand(skb,
- overhead, 0, flags);
+ /* Make writable and expand header space by overhead if required */
+ if (skb_cow_head(skb, overhead)) {
+ /* Must deallocate here as returning NULL to indicate error
+ * means the skb won't be deallocated in the caller.
+ */
dev_kfree_skb_any(skb);
- skb = skb2;
- if (!skb)
- return NULL;
+ return NULL;
}
if (csum) {
--
Looks like this patch has been accepted by the Linux netdevs, so once it is merged upstream we can pull it back to the Pi kernel tree. Looks like the same issue has now been found in at least 6 other drivers unrelated to Pi, and I am pretty sure something similar is in the wireless stack we use, and I am currently trying to sort that out. Thanks for all the testing help. I'll leave this issue open until it is merged.
Lovely, thankyou. It's been running cleanly for the last seven hours here, but I haven't really stressed it yet.
Hey yall. I'm late to the conversation, but I was experiencing kevent drops culminating in ethernet deadlocks. My environment is an RPi3B running bleeding edge OpenWRT. I didn't see any kevent drops when running eth0 without wlan0. I was able to consistently reproduce the issue by enabling and bridging wlan0 to eth0 under lan.
I applied both of @JamesH65 's patchs mentioned above, and I am not seeing this issue in my test setup. Prior to his patches, I was seeing this issue almost immediately after bridging. Post patches, I have been running some mild throughput automation and I haven't seen a single instance of the issue (or any non-BCDC packet messages for that matter either).
Thanks for the patches @JamesH65 . I will update quickly if I am able to reproduce the issue post patches or update later with some uptime stats.
Good news, thanks for the report. The BCDC message was a symptom of the issue, but didn't necessarily happen every time. However, it was the reason I was able to track the issue down!
I've been having this exact same issue for a few weeks and just investigated and found this bug report. How long (days or weeks) do you think it might be before the fixes land via the usual apt-get update / upgrade? Wondering whether to bother setting up cross compiler to build kernel over weekend or if it'll be available betime I get that done and built.
It has just been accepted on the linux netdev tree, so needs to now be merged to the main kernel tree, then backported to our kernel tree. I have NO idea how long that takes.
I reckon it might be quicker to do it yourself for the moment.
On 21 Apr 2017 23:24, "mangodan2003" notifications@github.com wrote:
I've been having this exact same issue for a few weeks and just investigated and found this bug report. How long (days or weeks) do you think it might be before the fixes land via the usual apt-get update / upgrade? Wondering whether to bother setting up cross compiler to build kernel over weekend or if it'll be available betime I get that done and built.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_raspberrypi_firmware_issues_673-23issuecomment-2D296319199&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=sHp64HjYVEDN-H790L4DY_J5MKM49zzA8TBJMZ4cmGI&s=rFsXh7BzvUbRTxDlMQaQfqfgj1i_j98igHao33fzUb0&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADqrHYxeOhUQprx0PX4eLVUz-5Fh7oQy-2DFks5rySyEgaJpZM4KaDH1&d=DwMFaQ&c=DpyQ_ftY536pf7wCBQXXU58xADDRY77THQzJu1OmzOo&r=w09_2ePv8G3zRjoV19Wm1Q6rI7CDlOns4PuRv2hHkek&m=sHp64HjYVEDN-H790L4DY_J5MKM49zzA8TBJMZ4cmGI&s=D7bpYEjJnRYwv_FbRqIm8GfbTi9Ex8uE4CWnoXgYljM&e= .
Ok. thanks very much for your efforts. Looking forward to a stable pi.
Only been running 10 mins so far but happy to say seems fixed - or at very least improved. As others said it normally hung shortly after or during boot. interface being accessible for only 10 seconds during boot. I'd then have to log in locally and restart networking to make it work again. I've rebooted a number of times and every time so far it has stayed working. Now for some extended testing.
Sorting the kernel tree is easy. It's then a case of prodding the right person to get the Raspbian repo updated which will require a fair amount of testing. There's also the small debate over when to switch from 4.4 to 4.9, so there may be a need to backport to both.
I've cherry-picked the commit and it should be available with latest rpi-update
kernel.
Excellent - thanks @popcornmix.
On 22 April 2017 at 22:23, popcornmix notifications@github.com wrote:
I've cherry-picked the commit and it should be available with latest rpi-update kernel.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-296402464, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHZ7Gu9mL_UPTP-pI7DJ36hfVt0UTks5rym-8gaJpZM4KaDH1 .
-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd
anyone else seeing corrupted data? not sure if related or a separate bug.
simple example - on raspian : netcat -k -t -l -p 1234
on another host
cat | netcat raspberrypi 1234
keep pasting known line - soon enough itll come through corrupted at the rpi end.
I discovered this with a websockets based home automation messaging service i'm using which kept failing with malformed packets. I wasn't sure if it was a bug in my code that hadn't shown up previously (seemed unlikely as i've been using it with no problem for a few years now on a number of devices) but simple test with netcat seems to suggest its something else.
I've only seen the issue via the wired interface, clients connected via wlan0 seem unaffected.
found this https://www.raspberrypi.org/forums/viewtopic.php?t=178933&p=1140487
@mangodan2003 are you testing with latest rpi-update kernel?
sorry - I should have said.
My pi is uptodate (as of Saturday) with apt-get update and apt-get upgrade
I then checked out 4.9.23-v7 from git, manually made the changes above as were trivial and I couldn't get the patch to apply cleanly the way i'm used to (patch --dry-run -p1 < /path/to/patch - repeat without --dry-run if ok (which it wasn't)).
Then transferred result (kernel image,modules, dtbs) to pi and rebooted.
And are you saying the corruption is only present after applying the patch, or did it also occur before the patch?
I have only tried and found the issue since applying the patch. I can try again with the stock kernel tonight if that helps.
Yes, please do. It would also be useful to test with latest rpi-update
kernel to rule out any issue you had with manually applying the patch/building the kernel.
Difficult to see how the patch could cause the problem described. I'll try and set something up here to replicate it, probably not today though.
On 24 April 2017 at 12:40, popcornmix notifications@github.com wrote:
Yes, please do. It would also be useful to test with latest rpi-update kernel to rule out any issue you had with manually applying the patch/building the kernel.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/firmware/issues/673#issuecomment-296630200, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHSX8EIFDoSDDKq8u5C-iNGCpvc-Gks5rzIo2gaJpZM4KaDH1 .
-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd
I'm seeing no errors on either 4.4 or the latest 4,9 from rpi-update. Will look again tomorrow.
I'm unable to reproduce this easily atm. suspect some very specific situation in which it occurs. Yesterday it would happen frequently - may be 1 in 5 lines of text pasted into netcat.
I just cloned my SD card via the ethernet to another (its mounted read-only btw so no corruption issue cloning running FS - and fsck it afterwards to be sure) - I had to restart this process several times (with appropriate skip and seek at each end) as it kept dying with an error about corruption mentioning MAC but i didn't get a note of the exact error but suspect this is also related as not an issue I've had before.
Edit: having done a quick google i beleive this is the msg i was seeing :
Corrupted MAC on input. Packet Corrupt
So I now have a second rpi setup running to test on too. Once I've managed to reliably reproduce the issue ill try both the rpi-update kernel and a build of my current kernel without the above changes and also check its not confined to happening on just the one pi.
@mangodan2003 One cause of random corruption would be too high an overclock or an insufficient power supply.
Are you overclocking? Does vcgencmd get_throttled
return non-zero after a failure?
Obviously there are other possibilities but we should rule out the obvious ones first.
not overclocked at all - never have been - not something i'm interested in these days. Running from 3Amp PSU.
Both pis are presently giving loads of errors - keep getting kicked out of ssh sessions because of it, rpi-update failed several times to download.
pi@raspberrypi:~ $ vcgencmd get_throttled throttled=0x0 Issue persists running the rpi-update kernel - not bothered to do a build without the above patch as rpi-update kernel appears to be the same thing.
As not related to above changes anything else i find i'll post in the forum linked above re mac spoofing.
I had wondered if it only happened whilst my active ssh sessions were via the wlan0 of the rpi being the AP but I ruled this out by logging into another host that is only connected to the other rpi (currently with no associated clients) via ethernet and setting it to regularly send strings. They also become corrupted from time to time.
rpi-update does include the patch that fixes the bridged wifi issue.
If you want to test without this patch then:
sudo rpi-update 06c104c37348e104d7bc108b8ad19697df93b589
will get last kernel before the patch.
Testing:
sudo BRANCH=stable rpi-update
will get you the stable 4.4 kernel. That would be another interesting data point for your netcat test.
ah woops - That should have been obvious when i rebooted and I didn't have to go and plug HDMI and keyboard in to restart networking. Having just tried I am still getting issues but also just checked and the PSU on the this clone RPi is only 2Amp, and I am seeing :
pi@lime:~ $ vcgencmd get_throttled throttled=0x50000
I have not checked what that means but am popping out for a bit now. Ill get back to it later. There is nothing other than the keyboard plugged in to need power - not sure if 2Amp is sufficient on the RPi 3 or not in that scenario.
Edit: 0x50000 means it has had "under voltage" and "throttled" conditions so ill try on this one again with a better PSU. however the other (original) one with 3Amp PSU has no bits set and uptime of nearly 50 hours now so am happy there is not a power issue there.
On a Pi 3, the ethernet will randomly lock up when bridged with the wifi interface. This takes the form of:
smsc95xx 1-1.1:1.0 eth0: kevent 0 may have been dropped
after which the wired ethernet stops working. Wifi continues as usual; devices associated with the Pi when in AP mode can ping it and each other, but (obviously) no packets are forwarded to the wired network. This severely limits its use as a wifi AP.
kevent 0 would appear to be EVENT_TX_HALT, which is triggered when the interrupt handler has too much work to do, and hands the processing off to the kworker thread. For some reason, although the worker thread seems to be executing correctly, the condition isn't cleared, probably in the hardware. I've no idea why this bug seems to be tickled by bridging it with the wifi. I've now reached the end of my kernel knowledge.
There's a thread here on the forums.
Thanks.