raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.08k stars 4.96k forks source link

[Pi 4B 8GB] When EEE (Energy Efficient Ethernet) is enable and active on eth0 at gigabit speed, ethernet becomes slow to unresponsive with packets dropping #4289

Open Photopuppet opened 3 years ago

Photopuppet commented 3 years ago

Summary When Energy Efficient Ethernet is enabled and active from a new install of Raspberry Pi OS on a Pi 4B 8GB, gigabit ethernet transfer with the built-in ethernet adapter becomes slow and unresponsive with packets being dropped. Other computers supporting EEE plugged into the same network behave normally, even if they are tested in the same port that the Pi was connected. When the Pi is connected with EEE enabled, it also results in some odd behaviour on other devices connected to the same switch such as causing the other device to drop packets and/or reconnect constantly. Removing the Pi from the network causes the behaviour on the switch and other device to return to normal.

To reproduce When Pi 4B 8GB is connected to gigabit switch with default EEE behaviours, the network performance of the Pi deteriorates as above.

Status from ethtool --show-eee eth0:

EEE Settings for eth0:
EEE status: enabled - active
Tx LPI: inactive
Supported EEE link modes: 100baseT/Full
1000baseT/Full
Advertised EEE link modes: 100baseT/Full
1000baseT/Full
Link partner advertised EEE link modes: 100baseT/Full
1000baseT/Full

Temporary fix As soon as EEE is disabled with 'ethtool --set-eee eth0 eee off', ethernet performance returns to normal at gigabit speeds.

Logs I am unsure which logs would be necessary to help diagnose the problem... please let me know and I will be happy to add them!

Equipment in use: Home network with BT Business Hub in use as modem and switch + Powerline ethernet with built in switch (BT Mini Connector)

A relevant forum thread: https://www.raspberrypi.org/forums/viewtopic.php?t=305820

ganzgustav22 commented 3 years ago

I've read here: https://www.raspberrypi.org/forums/viewtopic.php?t=305820 that it cannot simply be disable via the ethtool command because it gets re-enabled later for whatever reason. One user fixed this with a script running in a constant loop that disables it again and again once a second, which I'd like to avoid.

Another user there asked: "Does anyone know if dtparam=eee=off still works if in config.txt?"

Could maybe somebody here tell if that still works with the Pi4 (I'd test it myself but I don't have an EEE enabled switch)? That would be a nice workaround without messy scripts etc.

Photopuppet commented 3 years ago

I have put the entry into the config.txt and will update when a reboot comes to see if it sets the EEE off. :)

neggles commented 3 years ago

Can confirm dtparam=eee=off does not work on a Pi CM4, which makes sense since it's a Pi 3B+ specific thing according to the README (likely flips some stuff in the ethernet+USB chip the 3B+ uses)

Do we know what keeps turning EEE back on?

Wenri commented 1 year ago

I think #3292 is facing the same issue.

TheLevti commented 1 year ago

Have the same issue. I had to replaced my old modem, which had a 100mb interface, with a new one that has a 1gb link. Since then I am facing constantly 10%-20% packet loss. Disabling eee solved it for me as well. It seems very much like an issue with rpi4. Any solution? eee is quite an old and robust tech, can't be that we have to disable this nowadays.

pelwell commented 1 year ago

EEE is a hardware feature, and other than disabling it we have no control over how it works. It can be easily disabled with dtparam=eee=off.

Wenri commented 1 year ago

It seems that the dtparam=eee=off hacks only work for Pi 3B+. Disabling EEE via ethtool --set-eee eth0 eee off is the proper way in Pi 4 and CM4. This issue is widely reported (1,2,3,4), and have been confirmed with multiple reputable routers/switches with EEE enabled(in #3292). Should we disable EEE for Pi 4 by default?

pelwell commented 1 year ago

Should we disable EEE for Pi 4 by default?

No, because for the vast majority of users it saves power and causes no problems.

TheLevti commented 1 year ago

Should we disable EEE for Pi 4 by default?

No, because for the vast majority of users it saves power and causes no problems.

Do you have proof of that claim that it causes no problems? It seems like every rpi4 is affected and most just don't care to notice/do not notice. It becomes especially noticeable once you connect to a 1gb ethernet interface.

TheLevti commented 1 year ago

I have not seen any comment yet where someone says it works for him. If someone cares about power consumption and can handle constant connection losses, he can enable this feature. But as it is now, this is causing more trouble than what energy can be saved.

pelwell commented 1 year ago

Do you have proof of that claim that it causes no problems?

Only my own experience, and that of other RPi engineers. Do you think we are using 100Mbps Ethernet, either in the office or at home?

I have not seen any comment yet where someone says it works for him.

People don't bother to open or reply to issues for things that work for them. Nice as that would be, this area is for problem reports. Given the age of this thread and the number of contributors, it isn't as widespread a problem as you seem to think.

pelwell commented 1 year ago

If someone cares about power consumption and can handle constant connection losses, he can enable this feature.

The main consideration behind enabling EEE by default is the enormous number of Pis that are not and will never be connected by Ethernet. If the feature was opt-in, it would almost never be enabled.

TheLevti commented 1 year ago

If someone cares about power consumption and can handle constant connection losses, he can enable this feature.

The main consideration behind enabling EEE by default is the enormous number of Pis that are not and will never be connected by Ethernet. If the feature was opt-in, it would almost never be enabled.

This answer was about the claim that it works for the rest. It seems like it does not work at all for any RPI4, and we do not need to wait for someone to reply that it works. Just do a simple test, pick a random rpi4, connect it to a 1gbs interface and see for yourself.

By ignoring this hardware/design flaw, you basically knowingly let everyone suffer. In this state the RPI4 is useless for usage that is more than doing some homebrew experiments. Unless you are that poor guy who spends months debugging and eventually finding out that its this bug. Can this at least be somewhere documented if not fixed by disabling by default?

Question about:

The main consideration behind enabling EEE by default is the enormous number of Pis that are not and will never be connected by Ethernet. If the feature was opt-in, it would almost never be enabled.

Does the ethernet port consume any power at all if its not used/not connected where EEE would be even relevant?

Given the age of this thread and the number of contributors, it isn't as widespread a problem as you seem to think.

Unfortunately this is not the only place where people complain. The web is full with this bug.

JamesH65 commented 1 year ago

I've got a Pi4 here, that has been connected to our internal 1GB/s network, for the last 14 days. 674 RX errors, 0 TX errors.

So, I have effectively done your experiment, and it clearly works (674 RX errors from 5368536 packets is insignificant)

TheLevti commented 1 year ago

I've got a Pi4 here, that has been connected to our internal 1GB/s network, for the last 14 days. 674 RX errors, 0 TX errors.

So, I have effectively done your experiment, and it clearly works (674 RX errors from 5368536 packets is insignificant)

Is 1gb also enabled? Can you post a terminal output of it please (ethtool --show-eee eth0)?

6by9 commented 1 year ago

My CM4 running Raspberry Pi OS connected to a Netgear GS205

pi@raspberrypi:~ $ ethtool --show-eee eth0
EEE settings for eth0:
    EEE status: enabled - active
    Tx LPI: disabled
    Supported EEE link modes:  100baseT/Full
                               1000baseT/Full
    Advertised EEE link modes:  100baseT/Full
                                1000baseT/Full
    Link partner advertised EEE link modes:  100baseT/Full
                                             1000baseT/Full

Pi4 running Ubuntu on the same switch:

pi@pi:~$ ethtool --show-eee eth0
EEE settings for eth0:
    EEE status: enabled - active
    Tx LPI: disabled
    Supported EEE link modes:  100baseT/Full
                               1000baseT/Full
    Advertised EEE link modes:  100baseT/Full
                                1000baseT/Full
    Link partner advertised EEE link modes:  100baseT/Full
                                             1000baseT/Full

Both are connected at 1000Mb/s full duplex.

JamesH65 commented 1 year ago

I've got a Pi4 here, that has been connected to our internal 1GB/s network, for the last 14 days. 674 RX errors, 0 TX errors. So, I have effectively done your experiment, and it clearly works (674 RX errors from 5368536 packets is insignificant)

Is 1gb also enabled? Can you post a terminal output of it please (ethtool --show-eee eth0)?

Same result as 6by9's post.

TheLevti commented 1 year ago

If true that's an interesting outcome. I would assume that if its a hardware design flaw it would affect all devices and it does not look like some physical breaking issue. Any idea where this issue might coming from? Firmware? Does the interaction between both link partners play a role for proper EEE usage?

I would need to try my rpi4 on a different router/modem and see if this issue always appears to rule out that the used other side plays a role here, but because any other device on my router/modem works fine with eee and 1gbs, I would say its on the rpi4's side.

pelwell commented 1 year ago

There is clearly an incompatibility between some implementations of EEE - if you have an incompatible switch (note, I didn't say non-compliant or broken - compatibility requires both sides to get along) then it will probably fail 100% of the time, but conversely if you have a compatible switch it will be hard to understand what the fuss is about - it just works.

We have never claimed that the EEE implementation found on Pis works with all switches, but as I said above I don't think the non-working combinations are as common as you think.

Wenri commented 1 year ago

My CM4 running Raspberry Pi OS connected to a Netgear GS205

That is very interesting. I happened to test Pi 4 (Raspberry Pi 4 Model B Rev 1.2, 4GB) against Netgear GS108Ev3 several days ago and have the packet loss issue. Where the ping loss is about 10-15%. I was not expecting there would be much difference between GS205 vs GS108Ev3 except for the plastic or metal enclosure.

So, I have effectively done your experiment, and it clearly works (674 RX errors from 5368536 packets is insignificant)

Since the issue is caused by packet loss rather than receiving errors, I think the RX errors are not telling the amount of real lost packets. Some packets may never reach the driver to set the RX error counter before getting lost. Can you do a ping test from another machine on the same network and let it run for about 5-10 mins to see if you got any packet lost? In this test, the Pi needs to connect to a EEE enabled switch(ethtool shows EEE status: enabled - active).

People don't bother to open or reply to issues for things that work for them.

Note that the SSH connection is usually ok except for some random lags when the packet is lost during typing and echo. The same for other TCP applications. So I think maybe many people will not notice the problem.

TheLevti commented 1 year ago

Note that the SSH connection is usually ok except for some random lags when the packet is lost during typing and echo. The same for other TCP applications. So I think maybe many people will not notice the problem.

That is exactly what I have experienced also with my previous modem that had a 100mb interface. I guess it's less noticeable then. Even with the 1gbs connection, I am able to use ssh, also the application runs okayish. Issue was that I had random connection losses, timeouts etc and with switching to the new modem (1gbs) it got worse, my containers constantly crashed and had to restart whole day.

Since I disabled eee, ssh runs very smooth and so far no connectivity issues.

I have the Pi4B 8gb, running multiple high traffic applications on it.

6by9 commented 1 year ago

FWIW, at home with a Pi4 running TVHeadend running Buster (5.10.63 kernel), connected to a Netgear GS750E

pi@raspberrypi:~ $ uptime
 18:23:18 up 5 days, 23:02,  1 user,  load average: 0.10, 0.12, 0.09
pi@raspberrypi:~ $ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.2.250  netmask 255.255.255.0  broadcast 192.168.2.255
        inet6 fe80::cfb5:7c6c:486d:ea29  prefixlen 64  scopeid 0x20<link>
        ether dc:a6:32:xx:xx:xx  txqueuelen 1000  (Ethernet)
        RX packets 1982298  bytes 529189652 (504.6 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 934646  bytes 1217709716 (1.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 67166  bytes 3379900 (3.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 67166  bytes 3379900 (3.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether dc:a6:32:00:a9:4a  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

pi@raspberrypi:~ $ ethtool --show-eee eth0
EEE Settings for eth0:
    EEE status: enabled - active
    Tx LPI: disabled
    Supported EEE link modes:  100baseT/Full 
                               1000baseT/Full 
    Advertised EEE link modes:  100baseT/Full 
                                1000baseT/Full 
    Link partner advertised EEE link modes:  100baseT/Full 
                                             1000baseT/Full 

Copying the script from https://forums.raspberrypi.com/viewtopic.php?p=1830016#p1830016 to ping my router (on the another port of the GS750E, but one that is running multiple VLANs), and I get no dropped packets.

JamesH65 commented 1 year ago

Since the issue is caused by packet loss rather than receiving errors, I think the RX errors are not telling the amount of real lost packets. Some packets may never reach the driver to set the RX error counter before getting lost. Can you do a ping test from another machine on the same network and let it run for about 5-10 mins to see if you got any packet lost? In this test, the Pi needs to connect to a EEE enabled switch(ethtool shows EEE status: enabled - active).

So, did a 10 or so minute run pinging from Pi 4 to a Pi 3. 722 packets sent, no packet loss. EEE enabled as before.

neggles commented 1 year ago

This issue is not specific to the Pi 4. It can, and does, cause problems on just about any device. I work for an MSP managing several thousand devices and we've seen EEE issues with NICs from every major vendor. Realtek seem to be the worst for it, but some Intel/Broadcom NICs are just as bad, and it varies between individual units of the same model.

EEE is known to be a common cause of weird intermittent packet loss across most of the industry; The switch that's in use makes a much bigger difference than the NIC/device. Anecdotally, we have the most problems with $20 unmanaged D-Link/Netgear switches (people like to buy their own rather than just ask for one, for some reason) and the least trouble with big shiny expensive enterprise managed ones, but it's really hit-and-miss.

As an example, the Aruba 2530 8+2-port PoE managed switch has an EEE implementation that's absolutely garbage - all of the ~6 Pi4/CM4s I have exhibit this problem when connected to it with EEE enabled, along with quite a few other things - but the higher-port-count models in the same lineup are fine 🤷

Point being, while the Pi 4 does seem to have a somewhat higher incidence of EEE problems than the norm, it's not particularly out of the ordinary, but is more of a reflection on the EEE standard itself having compatibility problems in general than a problem with the Pi 4 specifically.

I don't think there's any real argument for disabling EEE by default - it works on the vast majority of switches, for the vast majority of people, the vast majority of the time, and it's not hard at all to find the solution once you search the web for 'pi ethernet packet loss', 'pi network dropout', etc. if it does happen to affect you.

It would maybe be nice to have a bundled-in systemd unit (disabled by default) to execute the ethtool command at startup, but it's trivial to make one yourself once you know what the issue is; as far as I can tell no major Linux distro has such a thing, implying that it's not common enough of an issue to be worth creating/packaging, which kind of says it all.

TheLevti commented 1 year ago

Thank you guys for all the effort with testing and explaining in detail. Especially the last answer was quite helpful to understand the whole picture with eee.

Wenri commented 1 year ago

Thank @6by9 and @JamesH65 a lot for the prompt testing results. That is solid proof that the EEE of Pi 4 is working well for other switches. Meanwhile, many thanks to @neggles for giving insights into how common the EEE issues are across the industry. It's true that disabling EEE by default is not necessary given that EEE works for most people.

It would maybe be nice to have a bundled-in systemd unit (disabled by default) to execute the ethtool command at startup, but it's trivial to make one yourself once you know what the issue is;

Yes, I also agree that having bundled-in scripts to execute the ethtool command will be handy. Especially I found doing this correctly is not that trivial. The ethtool --set-eee eth0 eee off command fails if the interface is not brought up. So running this command in rc.local or systemd units may have no effects. On the other hand, successfully running the ethtool command forces renegotiation of ethernet modes, which interrupts all network connections for several seconds. So one might want to have this done as early as possible during boot up.

neggles commented 1 year ago

Yes, I also agree that having bundled-in scripts to execute the ethtool command will be handy. Especially I found doing this correctly is not that trivial. The ethtool --set-eee eth0 eee off command fails if the interface is not brought up. So running this command in rc.local or systemd units may have no effects. On the other hand, successfully running the ethtool command forces renegotiation of ethernet modes, which interrupts all network connections for several seconds. So one might want to have this done as early as possible during boot up.

I can help with that; save this as /etc/systemd/system/disable-eee@.service:

[Unit]
Description=Disable EEE on %i on startup
Wants=network.target network-online.target
After=network-online.target

[Service]
Type=simple
RemainAfterExit=true
# Uncomment the below to do a slightly hacky check for whether the link is up.
#ExecStartPre=/bin/bash -c '[ $(cat /sys/class/net/%i/carrier) == "1" ]'
ExecStart=/sbin/ethtool --set-eee %i eee off
# If we get an unclean exit code, retry
Restart=on-failure
# Wait 30s before retrying
RestartSec=30s

[Install]
WantedBy=multi-user.target

then run sudo systemctl daemon-reload && sudo systemctl enable --now disable-eee@eth0.service - optionally you can omit the @ from the unit file name and hardcode the interface name by replacing %i with eth0/en<blah> depending on what your distro calls it.

This will attempt to set EEE off once the network has come online, and will try again every 30s until it succeeds. Optionally, if you uncomment ExecStartPre it will check whether the interface is up before attempting to set the EEE state - this is what I use on my own Pis (and other misbehaving devices), and so far I've found it to be reliable.

The other option would be to write up a patch for the bcmgenet driver adding an eee parameter, so that passing bcmgenet.eee=0 in the kernel command line would disable EEE at initialization; the igb driver used to have such a parameter, but it seems to have since been removed, so upstream might not be very willing to take such a patch 🤷

pelwell commented 1 year ago

I've been looking at the bcmgenet driver and its EEE support. It seems that currently EEE is not explicitly enabled, and yet ethtool thinks/knows it is enabled. I also found that I can't re-enable EEE once it has been enabled - there is a check that the PHY supports EEE advertising(?), and the check fails.

pelwell commented 1 year ago

If I were to create a Pull Request with a patch to the bcmgenet driver, would anyone here who is experiencing EEE problems be happy to apply the patch and build their own kernel to test?

pelwell commented 1 year ago

There's a PR - #5277 - that adds a module parameter and a dtparam (which just sets the module parameter). Either add genet.eee=N to /boot/cmdline.txt or add dtparam=eee=off to config.txt.

pelwell commented 1 year ago

Just a note to say that the current beta kernel (installable with sudo rpi-update) includes Pi 4 support for the eee dtparam. After installing the new kernel, add dtparam=eee=off to /boot/config.txt (or genet.eee=N to /boot/cmdline.txt) and reboot.

djmaze commented 1 year ago

@pelwell Thanks for the PR. It took quite some time for me to realize that our Pi 4 (in conjunction with our switch) completely took down our entire network every now and then. Even then I did not immediately find this valuable information. DIsabling the eee option immediately improved the responsiveness of the Pi in our network (and will hopefully prevent the network downtimes).

I would like to propose adding a prominent notice somewhere in the official documentation, with a hint that there might be ethernet problems and how to fix it. So it will hopefully be easier for other people to find out about this.

pelwell commented 1 year ago

If you have a preferred location and form of wording then you can submit it here: https://github.com/raspberrypi/documentation/pulls or here: https://github.com/raspberrypi/documentation/issues

chris-deluca commented 1 year ago

I'm seeing the behavior described by PhotoPuppet and commented on by pelwell and neggles (and others).

I adjusted the proposed setting and have started to see some more stable behavior - but it is still too soon to be 100% sure it is completely fixed. The attached chart shows the network periodically going up and down - I have 2 different forms of monitoring one internal and one external - while the "internal" chart collection process (shown) fails periodically (see dips), the "external" collection process (not provided) shows no such drop-offs.

The sites being monitored are 3 separate servers hosting 8 different resources. All resources are docker containers. The ONLY change made was to ssh into my pi (running Ubuntu 22.04.1) and execute the following command: ethtool --set-eee eth0 eee off - I didn't bounce containers or do any type of resetting. Also interesting is the "scrape time" shown in the 3rd chart reduced from almost 20 seconds to near 0.

As I said, this isn't conclusive proof as it has only been a 1.5 hours since the change was made, but it seems promising.

Xnip2023-01-30_10-24-17

Edit: And here is the chart with more time. This definitely made a significant difference.

Xnip2023-01-31_09-24-33

turboboost55 commented 1 year ago

I am affected by this issue as well. I would like to try the dtparam=eee=off line in config.txt, but how do I tell if I am running the kernel version that supports that? What version is it available in?

pelwell commented 1 year ago

Any kernel released since December 15th 2022.

turboboost55 commented 1 year ago

@pelwell Thanks!

freddyrios commented 1 year ago

It seems the issue only affects a subset of the devices (CM4 in our case). How severe also seems to depend on the device.

I don't have clear stats on it, but observationally it seems to be between 15 and 30%. It is also unclear what triggers it, as a device could be working fine on some early testing and later on exibit the issue. When unplugging/plugging cables the issue some times stops reproducing.

In other words, the issue can be very hard to track down in some devices and environments if one does not already know what one is looking for. Additionally, the device can easily pass early testing and in worse cases (ours) can be discovered only later on as some devices exibit "unexplicable network unstability".

freddyrios commented 1 year ago

In another thread, this comment claims one can also keep eee disabled by applying it in /etc/network/interfaces.d/eth0

https://github.com/raspberrypi/linux/issues/3292#issuecomment-777167803.

It seems like a simpler workaround, any caveats?

qrp73 commented 1 month ago

It appears that this issue is related to GENET PHY EEE feature: https://github.com/raspberrypi/linux/issues/6327

In short: when using direct device-device ethernet 100M connection with patch cable instead of crossover cable it leads to random unexpected link failures and packet drops with enabled EEE, disable EEE in config.txt solved this issue.