raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.15k stars 5k forks source link

unstable ethernet connection (unexpected breaks) between two Raspberry Pi boards (unexpected link down/up events) #6327

Closed qrp73 closed 2 months ago

qrp73 commented 2 months ago

Describe the bug

When I connect Raspberry Pi 4 to Raspberry Pi Zero 2W with ethernet cable, the connection is unstable. dmesg shows unexpected "Link is Down" and "Link is Up" events at random time.

When I connect Raspberry Pi 4 or Raspberry Pi Zero 2W to another linux device with the same network config, there is no connection breaks.

Also when I connect Raspberry Pi 4 to Raspberry Pi Zero 2W through router, the issue disappears. The issue happens when Raspberry Pi 4 is connected directly to Raspberry Pi Zero 2W with ethernet cable.

On RPI Zero 2W I'm using this USB-ethernet adapter with Realtek RTL8152 chipset: https://www.aliexpress.com/item/1005006625608264.html

Steps to reproduce the behaviour

Steps to reproduce: 1) Install latest Raspberry Pi OS Bookworm aarch64 on RPI4 and on RPI02W

2) Configure eth0 for network manager on both machine: sudo nano /etc/NetworkManager/system-connections/WIRE.nmconnection

id=WIRE
uuid=caf51b2f-12be-ca89-c36c-bc33e773cd7f
type=ethernet
autoconnect-priority=-999
interface-name=eth0

[ipv4]
method=manual
address1=192.168.1.10/24
dns=1.1.1.1;8.8.8.8;

ignore-auto-dns=true
dhcp-send-hostname=true

[ipv6]
addr-gen-mode=stable-privacy
method=disabled

sudo chmod 600 /etc/NetworkManager/system-connections/WIRE.nmconnection

Use address1=192.168.1.10/24 for RPI4 and address1=192.168.1.11/24 for RPI02W.

Note: parameter gateway= is missing from section ipv4, because we using direct connection with no gateway.

3) Connect RPI4 to RPI02W directly with ethernet cable

4) Reboot both machines

5) Login to RPI4, make sure eth0 is up.

Note: if eth0 is not active, make sure that eth0 is managed by NetworkManager, make sure there is no eth0 configs at /etc/network/interfaces.d/. Do the same check for RPI4 and for RPI02W.

6) Run dmesg -w on RPI4 terminal to monitor log for Link Up/Down events

7) Use second terminal on RPI4 to get ssh connection to RPI02W: ssh pi@192.168.1.11

8) Wait for some time or run btop on RPI02W ssh session

Expected result: connection remains uninterrupted

Actual result: connection is broken at some random time with "Link is Down", "Link is Up - 1Gbps/Full - flow control rx/tx" messages in dmesg on RPI4 and "r8152 1-1.3:1.0 eth0: carrier off", "r8152 1-1.3:1.0 eth0: carrier on" on RPI02W and then established again. Notice that connection breaks happens periodically in a loop with random time period about 1-30 seconds

Note: connection break happens with no reason, just at random time. Even when I don't pressing anything on keboard it still happens.

Device (s)

Raspberry Pi Zero 2 W, Raspberry Pi 4 Mod. B

System

RPI4:

$ cat /etc/rpi-issue && vcgencmd version && uname -a
Raspberry Pi reference 2023-09-22
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 40f37458ae7cadea1aec913ae10b5e7008ebce0a, stage4
May 24 2024 15:30:04 
Copyright (c) 2012 Broadcom
version 4942b7633c0ff1af1ee95a51a33b56a9dae47529 (clean) (release) (start)
Linux raspi 6.6.32-v8+ #1 SMP PREEMPT Fri Jun  7 16:39:58 UTC 2024 aarch64 GNU/Linux

RPI02W:

$ cat /etc/rpi-issue && vcgencmd version && uname -a
Raspberry Pi reference 2024-03-15
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, f19ee211ddafcae300827f953d143de92a5c6624, stage2
May 24 2024 15:31:28 
Copyright (c) 2012 Broadcom
version 4942b7633c0ff1af1ee95a51a33b56a9dae47529 (clean) (release) (start)
Linux rpi02w 6.6.31+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

Logs

RPI4:

[   10.730409] alsactl[772]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[   11.487677] bcmgenet fd580000.ethernet: configuring instance for external RGMII (RX delay)
[   11.490229] bcmgenet fd580000.ethernet eth0: Link is Down
[   15.583434] bcmgenet fd580000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   80.095247] bcmgenet fd580000.ethernet eth0: Link is Down
[   84.191661] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  113.887179] bcmgenet fd580000.ethernet eth0: Link is Down
[  116.963537] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  131.295276] bcmgenet fd580000.ethernet eth0: Link is Down
[  133.344603] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  154.847213] bcmgenet fd580000.ethernet eth0: Link is Down
[  157.919640] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  162.015186] bcmgenet fd580000.ethernet eth0: Link is Down
[  165.087668] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  168.159185] bcmgenet fd580000.ethernet eth0: Link is Down
[  171.231376] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  237.791280] bcmgenet fd580000.ethernet eth0: Link is Down
[  240.864777] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  276.703314] bcmgenet fd580000.ethernet eth0: Link is Down
[  279.775314] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  286.943175] bcmgenet fd580000.ethernet eth0: Link is Down
[  290.016901] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  301.279258] bcmgenet fd580000.ethernet eth0: Link is Down
[  303.327520] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  306.399182] bcmgenet fd580000.ethernet eth0: Link is Down
[  309.471710] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  359.647353] bcmgenet fd580000.ethernet eth0: Link is Down
[  361.695565] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  383.199260] bcmgenet fd580000.ethernet eth0: Link is Down
[  386.272879] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  403.679285] bcmgenet fd580000.ethernet eth0: Link is Down
[  406.752828] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[  452.831262] bcmgenet fd580000.ethernet eth0: Link is Down
[  454.880808] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
...

RPI02W

[   16.121579] brcmfmac: brcmf_cfg80211_set_power_mgmt: power save enabled
[   16.136148] r8152 1-1.3:1.0 eth0: carrier on
[   17.650799] systemd[694]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[  276.536024] r8152 1-1.3:1.0 eth0: carrier off
[  284.431608] r8152 1-1.3:1.0 eth0: carrier on
[  313.859807] r8152 1-1.3:1.0 eth0: carrier off
[  316.838712] r8152 1-1.3:1.0 eth0: carrier on
[  330.955997] r8152 1-1.3:1.0 eth0: carrier off
[  333.766694] r8152 1-1.3:1.0 eth0: carrier on
[  354.943842] r8152 1-1.3:1.0 eth0: carrier off
[  357.691570] r8152 1-1.3:1.0 eth0: carrier on
[  361.975866] r8152 1-1.3:1.0 eth0: carrier off
[  364.654588] r8152 1-1.3:1.0 eth0: carrier on
[  368.672033] r8152 1-1.3:1.0 eth0: carrier off
[  371.386604] r8152 1-1.3:1.0 eth0: carrier on
[  438.171837] r8152 1-1.3:1.0 eth0: carrier off
[  440.950618] r8152 1-1.3:1.0 eth0: carrier on
[  477.243978] r8152 1-1.3:1.0 eth0: carrier off
[  480.023995] r8152 1-1.3:1.0 eth0: carrier on
[  487.279975] r8152 1-1.3:1.0 eth0: carrier off
[  489.922608] r8152 1-1.3:1.0 eth0: carrier on
...

Additional context

The issue happens only when two Raspberry Pi devices connected directly.

When another Ethernet device is connected in the middle or on the second side, the issue disappears.

I tried to test it with linux PC on one side and both RPI4 and RPI02W works with no breaks in such installation. The connection break happens only when two Raspberry Pi devices connected together directly with ethernet cable.

pelwell commented 2 months ago

Zero 2 W does not have an Ethernet socket.

pelwell commented 2 months ago

...and therefore referring to "direct" connections is a bit misleading.

Are you providing the OTG Ethernet adaptor an external power supply?

qrp73 commented 2 months ago

Zero 2 W does not have an Ethernet socket.

yes, but as I said, I'm using USB-ethernet adapter on RTL8152 chipset. Here is the link: https://www.aliexpress.com/item/1005006625608264.html (see the last WHITE model with micro-usb connector).

Are you providing the OTG Ethernet adaptor an external power supply?

No, it's not required. RPI4 is powered form official RPI PSU. I'm power RPI02W from 2 Amps charger with 2.4 Amps USB cable. RPI02W together with OTG Ethernet adapter and connected rtlsdr consume in total max 550 mA at full 100 MBps transfer test. But the issue is reproduced without rtlsdr, so the peak current is about 275-300 mA.

Here is my measurements (taken with lab psu and DMM):

rpi02w       = 142-162 mA  (idle)
+hub        = 225 mA      (idle, no eth link)
+eth        = 275 mA       (idle & eth link active)
+eth+rtl    = 403 mA     (idle & eth link active + rtlsdr connected)
+eth+transfer = 550 mA (eth link & data transfer)

The voltage is ok, no voltage drops and no undervoltage flag.

The strange thing is that connection break happens when I connect two Raspberry Pi devices directly. When I put router in the middle, the issue disappears. There is no other error messages, just "Link down/up". Also there is noticeable lag in ssh session when it happens. You can notice it with just pressing some key for a long time... It looks like something stopping RPI02W for 1-2 sec, but the strange thing is that there is no high CPU spikes in htop and btop...

Also I notice that it happens more often when I run btop on RPI02W. It happens even when with idle ssh session, but it needs to wait some time, with running btop it happens almost every several seconds...

I suspect there is some issue in firmware, probably related with power save or something like that, but I'm not sure which one - RPI4 or RPI02W, because I cannot reproduce this issue when connect it to linux PC... It happens only when Raspberry Pi running on both sides.

PS: Just tried another (more short) ethernet cable, with the same result - the link is broken with some random period of time.

qrp73 commented 2 months ago

Just tried to connect it through another 100 Mbps switch (no routing, just ethernet switch), it also removes link breaks.

At a glance it looks like some issue in eth link negotiation. For some unknown reason it breaks link at random time. I think there is some error in firmware for GENET 5.0 EPHY or in firmware for RTL8152... But its reproduced only when they used together.

I think there is possible issue with Auto MDI-X feature of ethernet port. Because I'm using usual ethernet cable for connection (not cross-over cable). Ethernet port should automatically choose the MDI configuration for the cable, and since link is established it works, but it is possible that something is going wrong and it reset the link after some time. May be some bug in the firmware...

I think that it may relate to the bug in auto-MDI-X implementation, because there is no issue when automatic choose MDI config is not required (when Eth port connected with usual cable to the router or switch). But the issue happens when I use direct connection with usual cable, which require automatic MDI config.

pelwell commented 2 months ago

The power available from a USB port is not equal to the power available from the supply minus the power consumed by the Pi itself, and it's not necessary to see undervoltage warnings for power to be a contributing factor, but the fact that the addition of a switch fixes things makes it less likely.

qrp73 commented 2 months ago

I can assure you that the power supply has at least 10 times the power reserve. The actual power consumption of USB-Ethernet module is measured as 83 mA / 0.42 Watt when there is no Ethernet link (cable disconnected) and 135 mA / 0.68 Watt with Ethernet link active (cable is connected). This is the power measured from USB connector to USB Ethernet module. There is no sign of voltage drop on the USB Ethernet module and it works very well with no issue when it is connected to the router/switch or to a non Raspberry Pi Ethernet port (tested with x86 PC running Linux Arch with the same NetworkManager config).

I also tested different configurations: LinuxArchPC ===ethernet===> RPI02W = works ok LinuxArchPC ===ethernet===> RPI4 = works ok RPI4 ===ethernet===> 1G Ethernet router ===ethernet===> RPI02W = works ok RPI4 ===ethernet===> 100M Ethernet switch ===ethernet===> RPI02W = works ok RPI4 ===ethernet===> RPI02W = fails

As you can see the issue happens only when there is Raspberry Pi OS/firmware on both sides of cable. It was tested with Ethernet patch cable (not crossover cable), so in case of direct connection it involves MDI auto-negotiation for the Ethernet link establishment. So it looks that broken MDI auto-negotiation on Ethernet PHY is possible reason for that.

But I'm not sure why it works ok when Raspberry Pi is connected to non Raspberry Pi Ethernet port (Linux Arch PC in my case, which also involve MDI auto-negotiation) and fails when Raspberry Pi is connected to Raspberry Pi?

6by9 commented 2 months ago

AIUI MDIX auto-negotiation is not mandatory in the ethernet specs for 10baseT or 100baseT. It is on 1000baseT, but as your r8152 is only 100baseT that isn't relevant.

But I'm not sure why it works ok when Raspberry Pi is connected to non Raspberry Pi Ethernet port (Linux Arch PC in my case, which also involve MDI auto-negotiation) and fails when Raspberry Pi is connected to Raspberry Pi?

Which Pi are you talking about having connected to your PC? The Pi02 with r8152, or the Pi4? Any issues with the r8152 end are unlikely to be investigated by Raspberry Pi as we have no data on that chipset, and should be reported to the mainline Linux devs. Genet we may have a brief look at if it is a generic issue (ie not solely with r8152).

Does the situation improve if you force the Pi4 to only negotiate 100baseT via sudo ethtool -s eth0 speed 100? Is the link stable if you swap the Pi02 to use something other than r8152?

qrp73 commented 2 months ago

Which Pi are you talking about having connected to your PC? The Pi02 with r8152, or the Pi4?

Both - Pi02 and Pi4.

Does the situation improve if you force the Pi4 to only negotiate 100baseT via sudo ethtool -s eth0 speed 100? I will check.

No, the same issue.

Is the link stable if you swap the Pi02 to use something other than r8152?

I tried to replace Pi02 with Linux Arch PC: RPI4 ==ethernet==>USB8152 => LinuxArchPC

it shows the same issue. RTL8152 works on Pi02w with raspios the same as on other linux distro.

It looks that the issue is on RPI4 side, maybe some firmware bug in GENET firmware or hardware GENET 5.0 EPHY bug.

Interesting thing is that frequency of Eth link breaks definitely depends on the task which is running on RPI4. When I run iperf test, it happens very rarely. But when the system remains in almost idle state it happens more often. But when there is no Eth communications at all it happens rare. I tried to run stress -c 1, but it looks that it don't affect the issue, so I'm not sure what process affects the frequency of Eth link break.

6by9 commented 2 months ago

Is the link stable if you swap the Pi02 to use something other than r8152?

I tried to replace Pi02 with Linux Arch PC: RPI4 ==ethernet==>USB8152 => LinuxArchPC

Swap the r8152 for something else, but still using the Pi4 and Pi02 as the final nodes.

It looks that the issue is on RPI4 side, maybe some firmware bug in GENET firmware or hardware GENET 5.0 EPHY bug.

If it's a generic issue with any back-to-back connection between 2 devices, then we may investigate. If it is limited to Genet to r8152, then it's unlikely.

I wonder if it is Energy Efficient Ethernet (EEE) on the Pi4 upsetting things. Try adding dtparam=eee=on to config.txt on the Pi4.

pelwell commented 2 months ago

dtparam=eee=off?

qrp73 commented 2 months ago

I tried also USB-RTL8152 on RPI4 side (use RTL8152 instead of onboard GENET 5.0 EPHY).

RPI4 => USB-RTL8152 => Ethernet => LinuxArchPC

The link is stable in this setup.

So, we have pretty interesting picture:

LinuxArchPC => USB-RTL8152 => Ethernet => GENET RPI4 - fails RPI4 => USB-RTL8152 => Ethernet => LinuxArchPC - works ok

It shows that the Eth link issue is localized on RPI4 GENET 5.0 EPHY side. Because it happens only when GENET 5.0 EPHY is involved.

Also I found a new USB issue in Raspi OS. RPI4 with RaspiOS is unable to detect RTL8152 on USB at boot time (missing record in lsusb just after boot), when I reconnect it on the loaded OS it is detected and works ok. Tried with USB2 and USB3 connectors - both have this issue. Tested the same on Linux Arch and there is no such issue. I will open separate issue about it.

qrp73 commented 2 months ago

I wonder if it is Energy Efficient Ethernet (EEE) on the Pi4 upsetting things. Try adding dtparam=eee=on to config.txt on the Pi4.

dtparam=eee=off?

Yes! It works!

I added this line to /boot/firmware/config.txt on RPI4:

dtparam=eee=off

and after reboot tried direct connection between RPI4 GENET and USB-RTL8152 on RPI02W. The link failure issue disappears, now it works stable! :)

Does dtparam=eee=off affects only GENET PHY or also other components?

There is something very wrong with EEE feature, because it leads to link failure even when the link is loaded at 100% (95 Mbps continuous stream), it happens very rare at max speed, but still happens.

And probably it related with auto MDI config negotiation, because the issue happens only for direct device-device connections with patch ethernet cable, which requires to swap TX/RX paris. I think this issue may be actual for 10/100 Ethernet and may not be reproducible for 1G Ethernet, because GMII PHY uses pairs for transmit and receive in both directions simultaneously. So, it may explain why I catch it with 100M Ethernet adapter.

pelwell commented 2 months ago

Does dtparam=eee=off affects only GENET PHY or also other components?

Only GENET on a Pi 4. It's equivalent to adding genet.eee=N to cmdline.txt.

6by9 commented 2 months ago

There is something very wrong with EEE feature, because it leads to link failure even when the link is loaded at 100% (95 Mbps continuous stream), it happens very rare at max speed, but still happens.

And probably it related with auto MDI config negotiation, because the issue happens only for direct device-device connections with patch ethernet cable, which requires to swap TX/RX paris. I think this issue may be actual for 10/100 Ethernet and may not be reproducible for 1G Ethernet, because GMII PHY uses pairs for transmit and receive in both directions simultaneously. So, it may explain why I catch it with 100M Ethernet adapter.

EEE (IEEE 802.3az-2010) is in the same position that ethenet's duplex mode autonegotiation (IEEE 802.3 clause 28) was in 20-25 years ago(*) - it generally works, but there are some combinations of devices that have inter-operability issues. Neither device is non-compliant to the specification, but put the two together and they don't play nicely. The combination of Genet on Pi4 with r8152 is one such combination, but as you've seen, both devices are quite happy connected to other routers or switches. The solution is to disable it at one end, either via device-tree or other module parameter option, or via ethtool --set-eee eth0 eee off

FWIW The HDMI LA organises "Plug-fests" for their members to meet up with their new devices and have slots to check their devices against other member's devices to check for inter-operability issues, but even then it doesn't catch all issues. Ethernet isn't a licenced interface in the same way HDMI is, so that sort of event isn't really possible.

(*) https://en.wikipedia.org/wiki/Autonegotiation#Standardization_and_interoperability 1995 for the original spec, and 1998 for the improved version of the spec. So closer to 30 years ago, and I feel old!