openbmc / phosphor-networkd

Apache License 2.0
18 stars 49 forks source link

eth0 stuck in configuring, systemd-networkd-wait-online fails intermittently #39

Closed geissonator closed 3 years ago

geissonator commented 3 years ago

I've seen this intermittently for a while so I finally dug into it a bit. The symptom is that the systemd-networkd-wait-online service will fail.

root@rainier:/lib/systemd/system# systemctl status systemd-networkd-wait-online 
  systemd-networkd-wait-online.service - Wait for Network to be Configured
     Loaded: loaded (/lib/systemd/system/systemd-networkd-wait-online.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Thu 2021-05-20 19:22:12 UTC; 18h ago
       Docs: man:systemd-networkd-wait-online.service(8)
    Process: 719 ExecStart=/lib/systemd/systemd-networkd-wait-online (code=exited, status=1/FAILURE)
   Main PID: 719 (code=exited, status=1/FAILURE)

May 20 19:20:12 rainier systemd[1]: Starting Wait for Network to be Configured...
May 20 19:22:12 rainier systemd-networkd-wait-online[719]: Event loop failed: Connection timed out
May 20 19:22:12 rainier systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
May 20 19:22:12 rainier systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
May 20 19:22:12 rainier systemd[1]: Failed to start Wait for Network to be Configured.

In most cases, it seems like this service runs early enough that it passes (i.e. the network interface(s) get into a passable state. But in most cases, later in the boot to BMC Ready, they go back into states that would cause this server to fail. If the service happens to run a bit later in the boot, it will fail due to this.

Here's a witherspoon system running master (2.10.0-dev-1257-g90f181d3e). System is at Ready state and if I run the service directly, it fails:

root@witherspoon-Y230UF71K03T:/lib/systemd/system# /lib/systemd/systemd-networkd-wait-online
Event loop failed: Connection timed out

The issue appears to be that eth0 is stuck in "configuring"

root@witherspoon-Y230UF71K03T:/lib/systemd/system# networkctl 
IDX LINK TYPE     OPERATIONAL SETUP      
  1 lo   loopback carrier     unmanaged  
  2 eth0 ether    routable    configuring
  3 sit0 sit      off         unmanaged  

Here's what I hope are the relevant journal entries for this:

May 19 15:01:54 witherspoon-Y230UF71K03T kernel: ftgmac100 1e660000.ethernet eth0: irq 20, mapped at 9afe8c31
May 19 15:01:54 witherspoon-Y230UF71K03T systemd[1]: Listening on Network Service Netlink Socket.
May 19 15:01:54 witherspoon-Y230UF71K03T systemd[1]: Starting Network Service...
May 19 15:01:57 witherspoon-Y230UF71K03T kernel: 8021q: adding VLAN 0 to HW filter on device eth0
May 19 15:01:57 witherspoon-Y230UF71K03T kernel: ftgmac100 1e660000.ethernet eth0: NCSI: Handler for packet type 0x82 returned -19
May 19 15:01:56 witherspoon-Y230UF71K03T systemd-networkd[113]: /etc/systemd/network/00-bmc-sit0.network: Gateway= without static address configured. Enabling GatewayOnLink= option.
May 19 15:01:56 witherspoon-Y230UF71K03T systemd-networkd[113]: Enumeration completed
May 19 15:01:56 witherspoon-Y230UF71K03T systemd[1]: Started Network Service.
May 19 15:01:57 witherspoon-Y230UF71K03T systemd-networkd[113]: eth0: Link UP
May 19 15:01:57 witherspoon-Y230UF71K03T systemd-networkd[113]: eth0: Gained carrier
May 19 15:01:59 witherspoon-Y230UF71K03T systemd-networkd[113]: eth0: Gained IPv6LL
May 19 15:02:10 witherspoon-Y230UF71K03T systemd[1]: Starting Wait for Network to be Configured...
May 19 15:02:11 witherspoon-Y230UF71K03T systemd[1]: Starting Network Name Resolution...
May 19 15:02:11 witherspoon-Y230UF71K03T systemd[1]: Starting Network Time Synchronization...
May 19 15:02:13 witherspoon-Y230UF71K03T systemd[1]: Started Network Time Synchronization.
May 19 15:02:14 witherspoon-Y230UF71K03T systemd[1]: Listening on phosphor-ipmi-net@eth0.socket.
May 19 15:02:14 witherspoon-Y230UF71K03T systemd[1]: Started Network Name Resolution.
May 19 15:02:15 witherspoon-Y230UF71K03T systemd[1]: Reached target Network.
May 19 15:02:15 witherspoon-Y230UF71K03T systemd[1]: Reached target Host and Network Name Lookups.
May 19 15:02:21 witherspoon-Y230UF71K03T kernel: ftgmac100 1e660000.ethernet eth0: Set package 0x0, channel 0x0 as preferred
May 19 15:02:46 witherspoon-Y230UF71K03T systemd[1]: Starting Phosphor Network Manager...
May 19 15:02:52 witherspoon-Y230UF71K03T avahi-daemon[269]: Joining mDNS multicast group on interface eth0.IPv6 with address fe80::72e2:84ff:fe14:23e1.
May 19 15:02:52 witherspoon-Y230UF71K03T avahi-daemon[269]: New relevant interface eth0.IPv6 for mDNS.
May 19 15:02:52 witherspoon-Y230UF71K03T avahi-daemon[269]: Joining mDNS multicast group on interface eth0.IPv4 with address XXXXXXX
May 19 15:02:52 witherspoon-Y230UF71K03T avahi-daemon[269]: New relevant interface eth0.IPv4 for mDNS.
May 19 15:02:52 witherspoon-Y230UF71K03T avahi-daemon[269]: Network interface enumeration completed.
May 19 15:02:52 witherspoon-Y230UF71K03T avahi-daemon[269]: Registering new address record for fe80::72e2:84ff:fe14:23e1 on eth0.*.
May 19 15:02:52 witherspoon-Y230UF71K03T avahi-daemon[269]: Registering new address record for XXXXXX on eth0.IPv4.
May 19 15:02:52 witherspoon-Y230UF71K03T systemd[1]: Started Phosphor Network Manager.
May 19 15:02:55 witherspoon-Y230UF71K03T systemd-networkd[113]: eth0: DHCPv6 lease lost
May 19 15:02:55 witherspoon-Y230UF71K03T systemd-timesyncd[182]: Network configuration changed, trying to establish connection.
May 19 15:02:55 witherspoon-Y230UF71K03T systemd[1]: Stopping Network Service...
May 19 15:02:57 witherspoon-Y230UF71K03T systemd[1]: systemd-networkd.service: Succeeded.
May 19 15:02:57 witherspoon-Y230UF71K03T systemd[1]: Stopped Network Service.
May 19 15:02:58 witherspoon-Y230UF71K03T systemd[1]: Starting Network Service...
May 19 15:03:03 witherspoon-Y230UF71K03T systemd-networkd[384]: /etc/systemd/network/00-bmc-sit0.network: Gateway= without static address configured. Enabling GatewayOnLink= option.
May 19 15:03:03 witherspoon-Y230UF71K03T systemd-networkd[384]: eth0: Gained IPv6LL
May 19 15:03:04 witherspoon-Y230UF71K03T systemd-timesyncd[182]: Network configuration changed, trying to establish connection.
May 19 15:03:04 witherspoon-Y230UF71K03T systemd-timesyncd[182]: Network configuration changed, trying to establish connection.
May 19 15:03:04 witherspoon-Y230UF71K03T systemd-timesyncd[182]: Network configuration changed, trying to establish connection.
May 19 15:03:04 witherspoon-Y230UF71K03T systemd-networkd[384]: Enumeration completed
May 19 15:03:04 witherspoon-Y230UF71K03T systemd-timesyncd[182]: Network configuration changed, trying to establish connection.
May 19 15:03:04 witherspoon-Y230UF71K03T systemd-timesyncd[182]: Network configuration changed, trying to establish connection.
May 19 15:03:04 witherspoon-Y230UF71K03T systemd-timesyncd[182]: Network configuration changed, trying to establish connection.
May 19 15:03:04 witherspoon-Y230UF71K03T systemd-timesyncd[182]: Network configuration changed, trying to establish connection.
May 19 15:03:04 witherspoon-Y230UF71K03T systemd-networkd[384]: eth0: Could not set MAC address, ignoring: Device or resource busy
May 19 15:03:05 witherspoon-Y230UF71K03T systemd-timesyncd[182]: Network configuration changed, trying to establish connection.
May 19 15:03:07 witherspoon-Y230UF71K03T systemd[1]: Started Network Service.
May 19 15:03:08 witherspoon-Y230UF71K03T systemd[1]: Finished Wait for Network to be Configured.
May 19 15:03:10 witherspoon-Y230UF71K03T systemd[1]: Reached target Network is Online.
May 19 15:03:12 witherspoon-Y230UF71K03T systemd[1]: Started Network IPMI daemon.
May 19 15:06:12 witherspoon-Y230UF71K03T ipmid[381]: getHostNetworkData failed for GetSysBootOptions.
May 19 15:06:12 witherspoon-Y230UF71K03T ipmid[381]: getHostNetworkData failed for GetSysBootOptions.
May 19 15:06:12 witherspoon-Y230UF71K03T ipmid[381]: getHostNetworkData failed for GetSysBootOptions.
May 19 15:06:13 witherspoon-Y230UF71K03T ipmid[381]: getHostNetworkData failed for GetSysBootOptions.
May 19 15:06:24 witherspoon-Y230UF71K03T ipmid[381]: getHostNetworkData failed for GetSysBootOptions.

I need to dig into the systemd service a bit more, I'm wondering if this is what's getting it stuck?

eth0: Could not set MAC address, ignoring: Device or resource busy
geissonator commented 3 years ago

Some further digging found that my witherspoon system had a weird /etc/systemd/network/00-bmc-eth0.network file. It was configured for a static IP, but also had "DHCP=ipv6" in it. I changed that to "DHCP=no" and all is good now. On my rainier, it looks like we have a legit issue with bringing up DHCP on eth1.