Raspberry Pi 4: Network non-functional directly after (re)boot

M-Reimer commented 5 years ago

Describe the bug If I boot directly to an up-to-date Raspbian, then the network starts up in some kind of "no-functional" state. Means that dhcpcd does not manage to get an IP and trying to manually run dhcpcd on the interface hangs forever.

The problem resets if I unplug and replug the network cable. This triggers fetching a valid IP and properly enables the network interface.

It is also possible to reset from the non-functional state by running sudo mii-tool -r eth0 This also "unblocks" the network card and makes dhcpcd get a new IP.

To reproduce Seems like not everyone is able to reproduce this bug. Maybe it's even some kind of "hardware problem". But on affected Raspberry Pi 4 board, everything you have to do is to reboot. Result will be non-functional network.

Expected behaviour Network should come up without problems every time.

Actual behaviour Network hangs until mii-tool -r is called or the network cable is unplugged and replugged.

Logs I already published some logs here: https://www.raspberrypi.org/forums/viewtopic.php?f=66&t=244061#p1488426 I can provide more if needed.

It doesn't seem to be DHCP issues only. I configured my RPi 4 for static IP and rebooted several times. The journal always says that IP, route and DNS are set properly but it is impossible to reach the RPi.

Then I tried the switch thing. I still have some old 100MBit switch and connected it in place of my 1GBit one (D-Link DGS-108 https://www.amazon.de/dp/B000BCC0LO/). With this switch in place I was able to reboot 5 times and network was always available.

So yes, this changes with changing the switch. But of course I would prefer to run the 1GBit card on a 1GBit switch :stuck_out_tongue:

So I think if I buy a second one, then this will show exactly the same problem on this switch?

M-Reimer commented 5 years ago

Would be nice to get feedback about what to do with my "problematic" board.

If it helps in any way to debug this issue, I would send it in.

But maybe it would even be helpful for debugging if I keep the unit as I have the required test setup to trigger the issue. If an update is published, I could try if it also fixes the issue in my "test environment".

Anyway it would be nice to get some comment about this soon as currently the board is just collecting dust. I have no use for it if network fails regularly.

pelwell commented 5 years ago

This sounds like an auto-negotiation failure. It isn't hard to imagine how some switches might trigger the problem while others don't, but it's harder to explain differences between multiple Pi 4s running the same image unless there is a marginal timing somewhere, e.g. (and this is just a hypothetical example) the first round of auto-negotiation takes too long and one side either gives up completely or, when trying again, falls foul of a driver bug in an error path.

M-Reimer commented 5 years ago

I have a combination (RPi 4 and switch) which makes it possible to reproduce the issue every time.

So if there is a way to find out what is causing the problem, then I could try. So far the logs, I got, don't provide something useful.

pelwell commented 5 years ago

Can you post the output of mii-tool -vv eth0 before and after running mii-tool -r eth0?

M-Reimer commented 5 years ago

Before:

Using SIOCGMIIPHY=0x8947
eth0: negotiated 1000baseT-FD flow-control, link ok
  registers for MII PHY 1: 
    1140 796d 600d 84a2 0de1 c5e1 006f 0000
    0000 0300 0800 0000 0000 0000 0000 3000
    0000 0000 0000 0000 ffff ffff 0000 0000
    0400 ff1f 043e fff1 3403 0000 0000 0000
  product info: vendor 18:03:61, model 10 rev 2
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

After:

Using SIOCGMIIPHY=0x8947
eth0: negotiated 1000baseT-FD flow-control, link ok
  registers for MII PHY 1: 
    1000 796d 600d 84a2 0de1 c5e1 006f 0000
    0000 0300 0800 0000 0000 0000 0000 3000
    0000 0000 0000 0000 ffff ffff 0000 0000
    0400 ff1f 043e fff1 3403 0000 0000 0000
  product info: vendor 18:03:61, model 10 rev 2
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

pelwell commented 5 years ago

Only one register is different, the Basic Mode Control Register at offset 0: 0x1140 becomes 0x1000. However, the bits that are different are either not relevant when auto-negotiation is enabled (bit 8 - Duplex Mode) or reserved (bit 6). It could be that resetting the PHY is (just) a way of shaking the Ethernet driver out of its broken state.

M-Reimer commented 5 years ago

The problem doesn't occur every time. In some rare cases I get network directly after booting the "problematic" Pi. So I tried to catch this case where the network works directly after boot and got this directly after booting:

Using SIOCGMIIPHY=0x8947
eth0: negotiated 1000baseT-FD flow-control, link ok
  registers for MII PHY 1: 
    1140 796d 600d 84a2 0de1 c5e1 006f 0000
    0000 0300 3800 0000 0000 0000 0000 3000
    0000 0000 0000 0000 ffff ffff 0000 0000
    0400 ff1f 043e fff1 3403 0000 0000 0000
  product info: vendor 18:03:61, model 10 rev 2
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  1000baseT-HD 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

So I think it's safe to say that the register values don't matter in this case.

M-Reimer commented 5 years ago

I used "diff" to compare the dmesg output of a "good" and a "bad" start. No relevant differences. Is it possible to get additional info logged from this network card driver so maybe some difference can be found there? I have no knowledge about kernel debugging but recompiling a kernel would be no problem for me if needed. I created my first Pi 4 compatible Arch Linux ARM kernel on my own, too.

M-Reimer commented 5 years ago

I think that's the driver: https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/genet/bcmgenet.c

And it seems to have nearly no debug output messages in there. Adding some without knowing which functions may be interesting doesn't make sense.

If someone here can provide a patch to make this driver a bit more communicative at the interesting positions, I could apply this, compile the kernel and check if there is any difference between the outputs in "good" or "bad" state.

pelwell commented 5 years ago

I've got one of the D-Link switches on order, so I hope we can find a switch+Pi 4 combination that exhibits the problem.

iammer commented 5 years ago

I am also having problems with a Pi 4 and a DGS-108. In my case the Ethernet disconnects/reconnects during heavy traffic. See: https://www.raspberrypi.org/forums/viewtopic.php?f=28&t=247257

M-Reimer commented 5 years ago

Interesting. So this switch seems to be problematic in general and it's not just my switch. But there is still the problem that there are RPi 4 boards that work well on this switch. I hope @pelwell finds one which works for reproducing the issue. Restarting the switch does not help for my case. The switch is restarted daily but the problem persists.

M-Reimer commented 5 years ago

Today I received two more boards with 2GB RAM. I want to use them to do some tests with in-home network services to find out a bit more about server performance. Of course, network reliability is important there.

So I rebootet each of the boards 10 times via SSH to see if the network works on every boot. And it did. No problems at all with the two new boards.

So I guess the problem may be a bit rare. It requires the right switch and the right RPi 4 board. If you don't find a board to reproduce the issue, I can still offer you to send mine. But I would recommend that you maybe tell me your full name so I can put a note in to the package that it has to be forwarded directly to you so it isn't sent back to me after testing in an environment where the problem is not triggered.

pelwell commented 5 years ago

Drop me an email - phil@raspberrypi.org - and we can exchange details.

pelwell commented 5 years ago

Woot! I have a board that fails with the new D-Link switch (about the 10th I've tried). It's doesn't happen every time, but doesn't take more than a few goes to get a failure. The roughly equivalent Netgear switch I was using hasn't shown the issue in a handful of attempts.

pelwell commented 5 years ago

At the time the Pi gets the IPv4LL/Zeroconf 169.254.x.y address, the PHY link is up and receiving packets, but we knew that already. The DHCP traffic is different in the failure case - dhcpcd doesn't send a DISCOVERY packet, and the later REQUEST doesn't include a server ID. It's time to add some logging to the Ethernet driver, as I can't start tcpdump soon enough to catch the very early traffic.

rexx-org commented 5 years ago

I just received my Pi4 and can confirm I have the same issue with the lack of network on startup and working after unplugging the network cable. The Pi4 is plugged into a LGS108 with the main router FritzBox 7490. All my other Raspberry Pis from Model 2A to Model 3B+ work without any issue.

pelwell commented 5 years ago

Thanks - we have a critical mass of "Me too"s now.

ptesarik commented 5 years ago

All right. It seems my 1G RPi4 is also affected. Would it help if I captured network traffic from the other end (the switch)?

pelwell commented 5 years ago

I believe nothing will come out in the failure case, but it would be good to confirm that.

pelwell commented 5 years ago

So far I've failed to find a bit of PHY state that explains the error, and failed to find a place to insert the force re-negotiation.

pelwell commented 5 years ago

Hmm - I may have spoken too soon. This test is looking promising...

pelwell commented 5 years ago

See https://github.com/raspberrypi/linux/pull/3121 for what seems like an effective workaround.

popcornmix commented 5 years ago

Latest rpi-update kernel contains a workaround. Can you add genet.force_reneg=y to cmdline.txt and report back if the issue still occurs?

rexx-org commented 5 years ago

Can confirm the workaround with genet.force_reneg=y worked on my Pi 4B. Thanks.

ialexiad commented 5 years ago

I have the same issue, my pi 4 is connected to a turris omnia (gigabit router). Every few reboots the ethernet interface will fail to get an IP address, unless I manually set eth0 to down and then up or run mi-tool -r eth0.

Trying to troubleshoot, I ran tcpdump and noticed that DHCP discover packets were being generated by dhcpcd. Then I ran tcpdump on the router and another device on the same LAN and noticed that those DHCP discover broadcasts never reached the other end. Meanwhile tcpdump on the pi would capture broadcasts by the other devices on the LAN.

Then I thought I'd stop the dhcpcd service and set an IP manually, which I did and then tried to ping the router IP. Naturally, the pi4 started sending ARP requests for the router MAC, but never got an answer. It turns out the ARP packets, like DHCP, were not reaching the router, so at some point after they were being captured by tcpdump on the pi4, they were not transmitted to the wire correctly or at all. Same as before, incoming broadcasts were being received.

The strange thing is that when I tried to ping the pi4 from the router, the router sent an ARP request, the pi4 reply reached the router, and the router started sending echo requests and receiving replies from the pi4 without issues. At that point the arp table on the pi4 was populated with the MAC address of the router, so I thought I'd try again to ping the router from the pi4. This time, some requests got through to the router but with a loss of 82% and huge delays (some up to almost 1 second).

To summarize, I noticed that when the ethernet card of the pi4 is in that problematic state, it receives packets without issue, it replies to ARP requests and ICMP echo requests without issue (probably other traffic as well, I haven't tested), but any packets it generates and tries to send itself (ie not as a reply to a received packet) seem to get lost somewhere after the layer where tcpdump captures them (maybe they are corrupted and are discarded?).

I don't know what to make of this so I thought I'd report it here for anyone who can look into it further. Updating the kernel and setting the force_reneg option in cmdline.txt solves this issue for me as well.

pelwell commented 5 years ago

To summarize, I noticed that when the ethernet card of the pi4 is in that problematic state, it receives packets without issue, it replies to ARP requests and ICMP echo requests without issue (probably other traffic as well, I haven't tested), but any packets it generates and tries to send itself (ie not as a reply to a received packet) seem to get lost somewhere after the layer where tcpdump captures them (maybe they are corrupted and are discarded?).

Thanks - that's a useful refinement on my findings.

pelwell commented 5 years ago

Here's an interesting data point: it seems that, without the forced renegotiation workaround, the 5.3 kernel doesn't exhibit this bug.

pelwell commented 5 years ago

Or it could just be that the other 4B on my desk I was bringing up 5.3 on is one of the majority that aren't affected.

As you were.

pelwell commented 5 years ago

Seeing that the PHY gets a link very soon after power-on and wondering if the PHY is in a happier state before the driver starts led to an alternative workaround - skipping a reset step in the driver. This seems as effective as the forced renegotiation, but without adding a delay before getting a valid link and an IP address, so it is enabled by default. In case of regressions, add the following to cmdline.txt to disable it:

genet.skip_umac_reset=n

pelwell commented 5 years ago

The latest rpi-update firmware includes the new workaround.

Adytv1 commented 5 years ago

Hi guys, i have the same issue with eth0 it only shows on 1000baseT-FD, nighther reset, Restart eth0 helps. my kernel version is 4.19.58-v7l+ working on Raspberry pi 4 2gb . i buoght 3 of them all with the same problem.

pelwell commented 5 years ago

The latest rpi-update firmware includes the new workaround.

Run sudo rpi-update and retest. Kernels built from rpi-4.19.y since 4.19.65 should be OK, unless you have a different issue.

Adytv1 commented 5 years ago

Thank you for answering in such a short time, I did updated the kernel to 4.19.66 but the eth0 on 1000t base keeps restarting, and on 100t base works. Any other suggestions?

pelwell commented 5 years ago

That does sound different. Which switch is the Pi 4 connected to? The precise model would help.

Adytv1 commented 5 years ago

The model name is DSL-n55u dual band , brand Asus

ralphrmartin commented 5 years ago

I have a Pi 4 which has started showing something like this issue in the last day or so. I'm on the 4.19.66 kernel, and up to date with apt. I'm not sure if its the same issue or not, or maybe just hardware failure. Symptoms are:

RPi 4 shows eth0 as down, and attempts to bring it up with sudo ip link set dev eth0 up are to no avail and it stays down; dhcpcd5 appears to get stuck at start up.

RPi3B+ using the same disk (SD + SSD combo) boots and works fine, with eth0 up OK..

(eth0 connected to an EdgeRouter X).

Opened a new issue with further details - see #3195

xtronom commented 5 years ago

My Pi 4 is experiencing this problem too in combination with Cisco SG200-08 switch. Only one port is problematic, because it takes very long to negotiate (link LED comes on) on any ethernet device I connect to it. Good ports negotiate in 2 seconds, but this one needs 12 seconds.

So far only the Pi's genet sometimes fails completely to bring the link in a usable state. Restarting negotiation with mii-tool helps most of the time. When it works I don't experience any packet loss.

It's not always easy to reproduce the problem. Some cables make the problem more frequent, however I haven't noticed any effect on negotiation time.

In latest Raspbian (kernel 4.19.57), I can sometimes reproduce the problem if I unplug the power cable without shutdown and then replug it. Then I notice that ethernet doesn't come up in the bootloader and the link is down for long time even after DHCP client starts. If I do a shutdown and then replug the power cable, the issue doesn't occur.

I also use a custom buildroot system. This problem occurs there much more frequently. I suspect because the timing when the interface is brought up by userland in comparison to raspbian is different. I have tested different kernel versions from RPI git and none makes much difference.

The most reliable way to reproduce this problem is as follows:

take raspbian boot image (with kernel)
change root entry in cmdline.txt so it's invalid
boot the Pi, the kernel will panic
observer the link LED on the Pi and on the switch

In my case, the LED will be on if I connect the Pi to a good port. If I connect the Pi to the bad port, then LED on the Pi side will remain off, and the link LED on the switch will blink in regular intervals (1.5-2 seconds on, 3 seconds off). Additionally, I have also done this (for comparison) with a USB ethernet adapter connected to the Pi and it works fine, except it takes long for the LED to come on.

pelwell commented 5 years ago

Does the genet.skip_umac_reset=n cmdline.txt setting improve or worsen the symptoms?

ralphrmartin commented 5 years ago

I added genet.skip_umac_reset=y, which seems to cure the problem.

Addendum: I made a mistake in saying this - see below.

pelwell commented 5 years ago

I added genet.skip_umac_reset=y, which seems to cure the problem.

Really? That's the default...

xtronom commented 5 years ago

From my experience genet.skip_umac_reset either has no effect or it worsens the symptoms. Can't really tell, because it works randomly in either case, only the pattern maybe changes slightly. I've just tested again using 4.19.71-v7l+.

If the system boots with working ethernet, I can re-plug the ethernet cable and the network will not always come up. It is interesting, that when ethernet fails, the switch reports 1000 Mbps / FDX for the brief 1.5 - 2 seconds the link is established (but the LED on Pi remains off).

ralphrmartin commented 5 years ago

My mistake, apologies. I added genet.force_reneg=y which cured the problem.

pelwell commented 5 years ago

That was the previous version of the workaround. It was rejected at the time because it increased the time to get the link up significantly - enough to trigger LibreElec's fixed 10 second timeout - but perhaps something similar might still be useful.

ralphrmartin commented 5 years ago

Replacing genet.force_reneg=y with genet.skip_umac_reset=n also seemed to work well in my case.

xtronom commented 5 years ago

I've found another way to reproduce the problem:

remove sdcard
disconnect everything except LAN
power the Pi

When connected to a good port, the LAN LEDs will come up. When connected to a bad port, the Pi LEDs will stay off and the switch LEDs will blink in regular intervals as before. Sometimes LAN will come up even on a bad port if I power-cycle the Pi a few times.

It looks to me there's a race condition somewhere in the boot code or in the HW itself. Is it known yet, that a workaround is the most we can hope for?

pelwell commented 5 years ago

The Pi 4 bootcode currently has no support for network booting and does not initial the PHY, MAC or any associated clocks, so the behaviour you are seeing is just with the default PHY state.

N.B. Although an interesting data point, we are unlikely to put much effort into improving the handling of a known-bad switch port.

xtronom commented 5 years ago

I understand, thank you for the information.

zafrirron commented 5 years ago

After 3 long continues days of investigating similar ethernet issues (eth0 down...), I've observed the following:

The issues exists on my two RPI-4. The issue is not related to any network or device connected to the Pi (can be reproduced easily with no cable attached (eth0 is down) my tests done on clean Buster install (before and after ALL updates). I've observed different behaviour related to monitor attached to the HDMI connections I could bring back eth0 with static IP setting ifconfig eth0 xx.xx.xx.xx On several occasions performing (5) connected to the network and rebooting Pi with HDMI connected (the one closer to the USB) will block the boot (no screen), different behaviour on the second HDMI port. The above leads me to suspect in some kind of HW design issue.... hopefully this will help someone in future investigation....

divinehawk commented 5 years ago

My Rasp4b-2GB, ethernet link continually goes down under any traffic load. Switch is TEG-S80G. Let me know if there's anything to test.

raspberrypi / linux

Raspberry Pi 4: Network non-functional directly after (re)boot #3108