raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.03k stars 4.95k forks source link

Pi3b+: "kevent 4 may have been dropped" with lan78xx #2447

Closed jens-maus closed 5 years ago

jens-maus commented 6 years ago

Every second or third boot with a 4.14.26 kernel where the LAN78XX support had been enabled I receive the following notices in dmesg:

...
[   16.345186] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.346150] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.347155] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.348168] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.349165] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.350173] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.351150] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.352149] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.353164] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.354165] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   16.355150] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
...

Sometimes the system still continues and is able to setup eth0 properly. However, every now and then the system is not able to bring up eth0 and thus the boot up process is interrupted.

Any idea where this might come from or how to debug this situation?

JamesH65 commented 6 years ago

Bugger, thought the kevent issue was ancient history. IIRC, it is because an event is stuck on the system event queue and then another one turns up before it is serviced. But it's not been reported for a couple of years, and that was on the smsc75xx driver. Not sure this is the same thing though, might be a systemd initialisation issue.

jens-maus commented 6 years ago

The system I am developing and compiling the raspberrypi kernel for is not using systemd at all but is based on buildroot without systemd.

JamesH65 commented 6 years ago

OK, do you get the same problem with Raspbian?

jens-maus commented 6 years ago

Haven‘t tested it yet with Raspbian. However, this is the kernel defconfig I am using:

https://github.com/jens-maus/RaspberryMatic/blob/master/buildroot-external/board/raspberrypi3/kernel_defconfig

Bilg21 commented 6 years ago

I have the same issue with Compute Module 3. But it looks like a hardware issue. It's not happening on another setup which is identical.

danijelt commented 6 years ago

I am also using Buildroot to make my own image, and I managed to resolve this issue by disabling VLAN support in kernel. In my case, I do have systemd, and this message appears after systemd starts hostname service:

[  OK  ] Started Hostname Service.
[   17.326467] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   17.326507] 8021q: adding VLAN 0 to HW filter on device eth0
[   17.338617] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   17.345081] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped

After disabling VLAN support (CONFIG_BRIDGE_VLAN_FILTERING, CONFIG_VLAN_8021Q and CONFIG_VLAN_8021Q_GVRP) in latest rpi-4.14-y (4.14.33) this message stopped appearing almost completely, and I don't have the lockup bug anymore.

Firmware is ae9a493932e47e08cabb25a2728037298075fd00 and userland is a343dcad1dae4e93f4bfb99496697e207f91027e.

JamesH65 commented 6 years ago

@danijelt @jens-maus We have committed three or four fixes to VLAN support in the lan78xx driver since 4.14.33, so its probably worth getting the latest kernel and seeing if it has fixed the kevent issue with VLAN enabled. There are also other lan78xx fixes in there, so well worth trying to see if the kevent issues has gone.

If you do find it has gone, please close this issue.

JamesH65 commented 6 years ago

@danijelt @jens-maus If no further reports/updates are forthcoming, I'm inclined to close this report.

danijelt commented 6 years ago

We have released a stable production image without VLAN and I'm currently occupied with another project.

I'll keep this in mind during the next update and report here if kevent/VLAN issue persists.

fhunleth commented 6 years ago

I was able to easily reproduce this issue with the VLAN options disabled when running Linux 4.14.29 at commit c117a8bccf37bfba323065b566cf999ed4629a4a. I just tried updating to the latest rpi-4.14-y (4.14.44) at commit 4fca48b7612da3ff5737e27da15b0964bdf4928f, and I can no longer reproduce the kevent 4 issue.

bcutter commented 6 years ago

I also saw this some times on my Pi 3 B+. Currently the Pi is completely dead, on console screen the

"** 1 printk messages dropped ** [ 3808.632569] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped"

messages go wild meaning every second about 50 of those lines are shown. Only one option: hard reset of the system. Resulting in fsck errors and so on.

Raspbian Stretch by the way. 1 Gbit/s ethernet port. Anything I can do on this? Will there be a fix? A workaround?

JamesH65 commented 6 years ago

@bcutter Are you using the latest kernel? You can apt update, to get an official release, or rpi-update to get the latest bleeding edge one. You might need rpi-update, but backup first. There have been a number of kernel updates so worth trying.

bcutter commented 6 years ago

Currently on 4.14.34-v7+. Latest official release. I try to avoid using rpi-update, cause of the risk of running into other issues (maybe fix one and maybe grab some others - no perfect chance/risk ratio :-))... maybe my opinion will change temporarily if this kevent4 keeps on appearing.

vnd commented 6 years ago

I'm also experiencing this issue occasionally, 4.14.48-v7+. upd.: 3B+, error message exactly the same as in the first comment. upd2.: mostly happened on boot, but probably not limited to boot only (power turns off occasinally, so can't be sure).

bcutter commented 6 years ago

Updated to 4.14.48-v7+ recently, will monitor future behaviour. But @vnd´s comment doesn´t make me feel positive about this...

JamesH65 commented 6 years ago

Please note, the 3B error reporting kevent 0 is, I think, a different issue - please keep this thread to 3B+.

For those people using 3B+, do you still see the issue after a rpi-update? Report above seems to indicate it has gone away with the latest, but others say no, but I would like to confirm that it is still present, using the latest kernel on the 3B+.

JamesH65 commented 6 years ago

Also, can I just confirm this only happens during startup? Or have people encountered this error in general usage?

bcutter commented 6 years ago

I always experienced this during normal usage a few hours or even days after startup.

fhunleth commented 6 years ago

I'm no longer able to reproduce it, but when I could, it was by running ip link set dev eth0 down and ip link set dev eth0 up repeatedly.

JamesH65 commented 6 years ago

Just some notes from a a quick look at the driver code. kevent 4 is EVENT_LINK_RESET. This is only fired off in two places, first during opening the driver, and secondly when an interrupt is received (via USB, so not a 'direct line') from the PHY. I believe the only time an interrupt is received requires the link to be reset so this makes sense.

The action taken when the kevent is processed is fairly minimal, most work done is in lan78xx_link_reset, which does a bunch of phy/register reads and writes. Nothing I've seen that would take any real amounts of time. Note though that register reads all go via USB, so they could be delayed if the other end is busy.

JamesH65 commented 6 years ago

Been giving the latest kernel some stress testing on a debug enabled lan78xx driver. Running stress-ng --cpu 4, so loading the CPU's right up, then making the ethernet go up and down as quickly as possible (ip link set dev eth0 down/up), whilst monitoring for any kevents. No errors at all after well over an hour of this. Will leave it running for a few more hours.

JamesH65 commented 6 years ago

Cannot get this to go wrong. Run test scripts for hours, no issues. Clearly not exercising the right runes. If anyone seeing this has anything unusual in their networking setup, please let me know here. But for the moment I have to look at some other stuff.

bcutter commented 6 years ago

The only two „special“ things I have are:

1) Second virtual interface, so: eth0 eth0:0 where the second one has a different IP configuration. 2) Various HDDs connected directly or with a hub via USB to the Pi - which is quite different to many other setups e.g. of friends of mine I can compare to (and they don’t see this issue at all)

Hope this helps a bit.

danijelt commented 6 years ago

I have tried with latest 4.14.50 kernel (commit 3b01f05):

[    0.000000] Linux version 4.14.50 (danijel@dev) (gcc version 5.4.0 (Buildroot 2017.05-git-gd5acf12b1)) #1 SMP Wed Jun 20 10:59:49 CEST 2018

Still having the same problem when I enable VLAN support:

[  OK  ] Started Network Manager Script Dispatcher Service.
         Starting Hostname Service...
[  OK  ] Started Hostname Service.
[   17.853152] 8021q: adding VLAN 0 to HW filter on device eth0
[   17.853178] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped
[   17.853924] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped

My setup: Buildroot 2017.05, systemd 232, NetworkManager 1.8.0. kevent error appears appears after systemd-hostnamed service is started and kernel notifies that VLAN 0 is added.

JamesH65 commented 6 years ago

@danijelt Does this only occur during startup? Does it ever cause any problems or does everything still work OK?

6by9 commented 6 years ago

Configuring the VLAN mask and setting up multicast subscriptions both use worker threads https://github.com/raspberrypi/linux/blob/rpi-4.14.y/drivers/net/usb/lan78xx.c#L1036 https://github.com/raspberrypi/linux/blob/rpi-4.14.y/drivers/net/usb/lan78xx.c#L2325

VLAN takes data that is precomputed, and multicast works out the settings in the worker, so being requested twice before having been called for the first time should have no side effect (minor assumption that the flag is cleared before the worker task is executed). They're independent work queues, so there shouldn't be a conflict between the two. Unless it's indicative of a worker thread having wedged totally, and that is then preventing any other worker thread from running. I don't know enough about how INIT_WORK works when it comes to scheduling.

danijelt commented 6 years ago

@JamesH65 Yes, only during startup. Once it boots (if kevent 4 doesn't happen), I don't have any problems anymore.

I don't do anything special with the network, just bring it up at boot and leave it that way. I'm not even touching anything VLAN related, adding VLAN 0 is probably NetworkManager's (or kernel's) default behaviour.

JamesH65 commented 6 years ago

I suspect that the kevent reports on startup are benign, but if people are getting them after the startup process is complete, that's a bit more worrisome.

bcutter commented 6 years ago

Well, my craziest experience with this error was when the whole system was unresponsive and out of service. Checking the console gave me a few hundred lines every second:

http://bit.ly/2MpaxS8

k-mlynarski commented 6 years ago

I'm experiencing it only at the end of boot process (using the 4.14.50-v7+) on every 3B+ I have. No further occurrences, or any issues so far.

vnd commented 6 years ago

Update: ethernet port died (mii-tool eth0 => "eth0: no link", "link ok" through USB<->Ethernet adapter, but ping and everything else doesn't work even then), so may be it was an early manifestation of hardware problem.

pjarmuzek commented 6 years ago

I'm using ComputeModule 3, with kernel 4.14.54-v7+, and I'm getting that error too:

root@raspberrypi:~# dmesg | grep eth0 [ 6.275493] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 6.275541] lan78xx 1-1.6:1.0 eth0: kevent 4 may have been dropped [ 6.303397] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

However, have no issues with LAN78xx at all, it works, just everytime showing that message.

bcutter commented 6 years ago

Today I discovered my Pi "crazy dead" again, to be more specific: exactly same console output as shown in https://github.com/raspberrypi/linux/issues/2447#issuecomment-400570396.

I really don´t know what else to do or try on this! I even rpi-update´d a few weeks ago. Something I can check or provide some logs or anything like that? I don´t see any progress (in my case) :-(

pelwell commented 6 years ago

You need to rpi-update again - a fix went in last week for a network driver bug that seems to have resolved many of the crashes and lockups people have been experiencing.

bcutter commented 6 years ago

Thx for that information. I had the release feed of the GitHub project in my RSS reader but accidentally removed it meanwhile so I had no idea there´s a newer version.

According to "uname -a" the Pi 3 B+ is now on Linux raspberrypi 4.14.54-v7+ #1126 SMP Wed Jul 11 20:01:03 BST 2018 armv7l GNU/Linux

Will monitor it closely the next days and weeks.

bcutter commented 6 years ago

Great. Crashed again, but this time with a different response:

"kevent 0 may have been dropped" at the end of the line, repeated every few seconds. System unresponsive but allowed a controlled reboot with STRG + ALT + ENTF.

So i´m lost again...

JamesH65 commented 6 years ago

@bcutter Going to need to know what you are doing on the Pi to provoke the error, how you have it set up, what going on at the time of error etc. The fixes that have gone in seem to have fixed a lot of lan78 errors, but this one is a new one, so presumably something to do wtih your setup.

bcutter commented 6 years ago

I think so too. At the time of error the following happened:

I can´t provoke the error. Changes in configuration next to the rpi-update from yesterday: removed and disabled swap file, today a message "RAM is low, I´m going to kill php", but that was hours before the kevent 0 thing.

My "special" setup: https://github.com/raspberrypi/linux/issues/2447#issuecomment-397648281

JamesH65 commented 6 years ago

Had a quick glance through the code. Kevent 0 appears to be a tx stall, cause by the USB system returning an EPIPE error. This seems to be caused by a failure in the USB system itself, a protocol stall (drivers/usb/core/hcd.c!700 seems a likely culprit). There are a number of ways in that function that coudl cause the stall. All of these completely out of my realm of 'expertise'. We woudl need a reliable way of replicating this in order to get to the bottom of it.

rstanislav commented 6 years ago

Any updates on this ? This affects every 3b+ i have and makes them close to useless.. about every 1/3 or 1/4 cold boots this bug occurs..

JamesH65 commented 6 years ago

Odd, you appears to be one of the few experiencing this error. I pressume you are running the latest Raspbian. Is there anythingunusual about anythingplugged in to the Pi, or your network setup? Do you have another Pi3B+you can test on, to see if it might be a HW problem?

rstanislav commented 6 years ago

I have tested this on ~4 rpi 3b+, all works the same, OS is not raspbian, i'm using this project(and discussion of this problem): https://github.com/Ysurac/openmptcprouter/issues/128#issuecomment-418644313

But that project using current version of rpi kernel with all patches in this thread applied. And strange thing is - i have compiled custom kernel with VLAN support disabled, and problem is still exist.

JamesH65 commented 6 years ago

Would you be able to test with Raspbian? Since most people do not appear to see this issue, it woudl be useful to know if its OS specific.

akaher commented 6 years ago

Getting this issue and eth0 is not working, using 4.14.54 Linux kernel:

[ 10.101247] lan78xx 1-1.1.1:1.0 (unnamed net_device) (uninitialized): No External EEPROM. Setting MAC Speed [ 10.124946] libphy: lan78xx-mdiobus: probed [ 10.204327] usbcore: registered new interface driver lan78xx [ 10.388623] lan78xx 1-1.1.1:1.0 eth0: No External EEPROM. Setting MAC Speed [ 10.390004] lan78xx 1-1.1.1:1.0 eth0: kevent 4 may have been dropped

rstanislav commented 6 years ago

I don't think its OS specific because openmptcp using 4.14 kernel and has all patches related to lan78xx (https://github.com/Ysurac/openmptcprouter/tree/master/root/target/linux/brcm2708/patches-4.14) also this happens when system if not fully loaded and only if LAN interface connected, without LAN cable it boots without any problems every time.

Also this may be related: this not happens if RPi have no USB devices connected(not many people using 4 modems and that's why this issue not happening for them), and happens more frequently the more devices connected - in my case i'm testing almost every time with 4 LTE USB modems connected (VBUS USB power is cut and modems have their own power supply (with common ground with rpi ofcourse), so its not power problem.

Is there any way to debug this ? i also have uart connection to rpi for console

JamesH65 commented 6 years ago

It would still be worth testing with raspbian since it might be some userspace setup. Also, you may be taxing the USB system quite heavily. Does everything work without the USB devices plugged in?

jens-maus commented 6 years ago

As I am the originator of this ticket, I can say that since I updated to recent kernel+firmware combinations (using 4.14.62 lately) I haven't seen these kevent outputs anymore and everything seems to work flawlessly in my environments (buildroot-based Linux environment).

rstanislav commented 6 years ago

Strange thing is that this happens only at boot, if system loaded it works without any problem for hours, sometimes it can say something about kevent 4 - but like 1 or 2 messages during work (installed in car, so average work time is 2-5 hours between full reboots) so with very high load on usb subsystem (4 modems with mptcp pushing 90-92 mb/sec thru ethernet(that is also on usb chip in rpi from what i know) connected wifi router (using it only for wifi))(wi fi router connected via 100mb link (4 wires)) - i never saw this problem during work, only at boot time.

JamesH65 commented 6 years ago

There's a lot of work going on during boot, wihch is more likely to cause a kevent to be dropped (starved of CPU, a job is added to the system queue, but doesn't get handled before another of same type is added). Perhaps we need to have a retry in the driver...hmm. If I have time I'll take a look.

pelwell commented 6 years ago

These messages, although annoying, should be harmless most of the time - they just indicate that an operation (which ought to be idempotent) is being requested for a second time before the system has managed to handle it the first time. If you are seeing a stream of them and the interface isn't working then the message is likely to be a symptom of a wider problem rather than the cause.