systemd / systemd

The systemd System and Service Manager
https://systemd.io
GNU General Public License v2.0
13.25k stars 3.79k forks source link

systemd-networkd is not (re-)started after initrd-switch-root #27718

Closed floppym closed 1 year ago

floppym commented 1 year ago

systemd version the issue has been seen with

253.4

Used distribution

Gentoo

Linux kernel version used

6.1.29

CPU architectures issue was seen on

x86_64

Component

systemd-networkd

Expected behaviour you didn't see

systemd-networkd.service should be started after systemd is re-execed after switching root from the initrd (dracut-059).

Unexpected behaviour you saw

systemd-networkd.service is de-activated after initrd-switch-root.service starts and systemd is re-execed. It is not restarted automatically. This results in a system with no network connectivity after a reboot, until the service is started manually.

Steps to reproduce the problem

  1. Create an initrd using dracut with the systemd-networkd module enabled (this happens by default on my system).
  2. Boot the system using this initrd.

Also see https://github.com/systemd/systemd/issues/27683, https://github.com/systemd/systemd/pull/27617, https://github.com/systemd/systemd/pull/27649.

Additional program output to the terminal or log subsystem illustrating the issue

% journalctl -b -1 | grep -E "networkd|system mode|switch"
May 19 22:46:41 gentoo kernel: Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
May 19 22:46:41 gentoo kernel: Console: switching to colour frame buffer device 200x75
May 19 22:46:41 gentoo systemd[1]: systemd 253 running in system mode (+PAM -AUDIT -SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT +QRENCODE -TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified)
May 19 22:46:41 gentoo systemd[1]: Listening on systemd-networkd.socket.
May 19 22:46:41 gentoo systemd[1]: Starting systemd-networkd.service...
May 19 22:46:42 gentoo systemd-networkd[452]: lo: Link UP
May 19 22:46:42 gentoo systemd-networkd[452]: lo: Gained carrier
May 19 22:46:42 gentoo systemd-networkd[452]: Enumeration completed
May 19 22:46:42 gentoo systemd[1]: Started systemd-networkd.service.
May 19 22:46:43 gentoo systemd[1]: systemd-networkd.socket: Deactivated successfully.
May 19 22:46:43 gentoo systemd[1]: Closed systemd-networkd.socket.
May 19 22:46:43 gentoo systemd[1]: Reached target initrd-switch-root.target.
May 19 22:46:43 gentoo systemd[1]: Starting initrd-switch-root.service...
May 19 22:46:44 naomi systemd[1]: systemd 253 running in system mode (+PAM -AUDIT -SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT +QRENCODE -TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified)
May 19 22:46:44 naomi systemd[1]: systemd-networkd.service: Deactivated successfully.
May 19 22:46:44 naomi systemd[1]: initrd-switch-root.service: Deactivated successfully.
May 19 22:46:44 naomi systemd[1]: Stopped initrd-switch-root.service.
May 19 22:46:44 naomi systemd[1]: Stopped target initrd-switch-root.target.
May 19 22:46:44 naomi systemd[1]: Listening on systemd-networkd.socket.
May 19 22:46:45 naomi kernel: Console: switching to colour dummy device 80x25
May 19 22:46:45 naomi kernel: Console: switching to colour frame buffer device 200x75
RA-Kooi commented 1 year ago

Is systemd-networkd activated in the host system?
What is the output of systemctl status systemd-networkd?

floppym commented 1 year ago

Is systemd-networkd activated in the host system?

I'm not sure what you mean by that.

What is the output of systemctl status systemd-networkd?

○ systemd-networkd.service - Network Configuration
     Loaded: loaded (/usr/lib/systemd/system/systemd-networkd.service; enabled; preset: enabled)
     Active: inactive (dead) since Sun 2023-05-21 12:15:09 EDT; 32s ago
   Duration: 1.661s
TriggeredBy: ● systemd-networkd.socket
       Docs: man:systemd-networkd.service(8)
    Process: 459 ExecStart=/usr/lib/systemd/systemd-networkd (code=exited, status=0/SUCCESS)
   Main PID: 459 (code=exited, status=0/SUCCESS)
     Status: "Shutting down..."
        CPU: 28ms

May 21 12:15:08 gentoo systemd[1]: Starting systemd-networkd.service...
May 21 12:15:08 gentoo systemd-networkd[459]: lo: Link UP
May 21 12:15:08 gentoo systemd-networkd[459]: lo: Gained carrier
May 21 12:15:08 gentoo systemd-networkd[459]: Enumeration completed
May 21 12:15:08 gentoo systemd[1]: Started systemd-networkd.service.
floppym commented 1 year ago

It's also worth noting that this doesn't happen on every boot. There seems to be some kind of race condition.

If systemd-networkd.service happens to get stopped before systemd is re-execed, everything works as it should.

RA-Kooi commented 1 year ago

I'm not sure what you mean by that.

I mean if it's enabled outside of the initrd.

    Loaded: loaded (/usr/lib/systemd/system/systemd-networkd.service; enabled; preset: enabled)
    Active: inactive (dead) since Sun 2023-05-21 12:15:09 EDT; 32s ago

That is weird, it's clearly enabled yet it's not activated. Are there any services that failed (systemctl --failed)?

It's also worth noting that this doesn't happen on every boot. There seems to be some kind of race condition.

If systemd-networkd.service happens to get stopped before systemd is re-execed, everything works as it should.

I wonder if this is an issue with dracut's configuration of systemd-networkd. Seeing how only the local interface is enumerated in the initrd, I assume you don't need networking in the initrd. Try disabling dracut's networking module if possible.

floppym commented 1 year ago

That is weird, it's clearly enabled yet it's not activated. Are there any services that failed (systemctl --failed)?

No failed services.

Try disabling dracut's networking module if possible.

I have no need for networking support in the initrd. It is simply enabled by default.

Disabling the network module in dracut avoids the problem, since systemd-networkd.service never gets activated in the initrd.

RA-Kooi commented 1 year ago

Well I'd say do that for now. If you ask me, this is most likely a problem on dracut's side of things, but the actual devs may be able to comment on that better.

floppym commented 1 year ago

This is a bug report intended for the systemd developers. I'm not seeking assistance in working around the issue.

Most of the systemd-related initrd logic is shipped with systemd. My best guess is this is a systemd issue, not a dracut issue.

floppym commented 1 year ago

I dug into the initrd a bit, and I can't find anywhere that systemd-networkd.service is being activated explicitly.

I think it is actually being activated by a netlink message via systemd-networkd.socket.

RA-Kooi commented 1 year ago

I think you are looking for: https://github.com/dracutdevs/dracut/blob/master/modules.d/01systemd-networkd/module-setup.sh#L66

floppym commented 1 year ago

I think you are looking for: https://github.com/dracutdevs/dracut/blob/master/modules.d/01systemd-networkd/module-setup.sh#L66

Nope. That sets up a symlink at /etc/systemd/system/multi-user.target.wants/systemd-networkd.service. However, multi-user.target never gets activated in an initrd. The default unit is initrd.target, which does not actually pull in systemd-networkd.service.

RA-Kooi commented 1 year ago

That sets up a symlink at /etc/systemd/system/multi-user.target.wants/systemd-networkd.service.

Er... you are correct, I misremembered thinking it has WantedBy=sysinit.target. Could there be something that pulls in network.target or network-online.target?

Now I wonder if enabling it the way they do isn't erroneous to begin with.

floppym commented 1 year ago

systemd-networkd.service isn't pulled in by either network.target or network-online.target.

Anyway, I ran a boot with log_level=debug, and it seems to confirm that systemd-networkd.service is being socket-actived as I suspected. That's likely a mistake in the dracut module setup. However, I don't think it directly relates to the problem I originally reported in this issue.

RA-Kooi commented 1 year ago

Can you check your journal log for Stopped target Socket Units. before switch root? That should be taking systemd-networkd down with it.

EDIT:

systemd-networkd.service isn't pulled in by either network.target or network-online.target.

Not directly, but enabling systemd-networkd will also enable systemd-networkd-wait-online, which does get pulled in by network-online.target.

floppym commented 1 year ago

Here's a full journal from a boot that exhibits the issue. systemd-networkd.socket gets deactivated before systemd is re-execed, but systemd-networkd.service does not. The re-exec occurs on line 1163.

https://gist.github.com/floppym/951950c0aff2b95146b2c964b87372c6

yuwata commented 1 year ago
floppym commented 1 year ago

Here's a boot with systemd.log_level=debug.

https://gist.github.com/floppym/a6b9d95105e743340cb019f43511d656

floppym commented 1 year ago

I'm seeing the same behavior with v253.3.

yuwata commented 1 year ago

Thanks. From the log,

  1. in initrd, networkd.socket triggers networkd.service
    May 21 20:05:44 gentoo systemd[1]: systemd-networkd.socket: Incoming traffic
  2. on switching root, networkd.service is not stopped, as it does not have Conflicts= for switching root.
  3. however, on switching root, dbus connection is closed, and networkd is stopped (why??):
    May 21 20:05:45 gentoo systemd[1]: dbus-broker.service: Thawing unit.
    May 21 20:05:45 gentoo systemd-networkd[446]: Bus bus-api-network: changing state RUNNING → CLOSING
    May 21 20:05:45 gentoo dbus-broker[506]: Dispatched 248 messages @ 1(±2)μs / message.
    May 21 20:05:45 gentoo systemd-networkd[446]: Bus bus-api-network: changing state CLOSING → CLOSED
    May 21 20:05:46 gentoo systemd-networkd[446]: Terminate operation initiated.
    (switching root)
    May 21 20:05:46 naomi systemd[1]: systemd-networkd.service: Got notification message from PID 446 (STOPPING=1, STATUS=Shutting down...)
    May 21 20:05:46 naomi systemd[1]: systemd-networkd.service: Changed running -> stop-sigterm
  4. at the time PID1 is restarted on switching root, the old invocation of networkd.service is still running, hence PID1 does not queue start job for the service after switching root.
yuwata commented 1 year ago

BTW, why networkd is included in initrd? I'd like to support that, but at least currently, networkd does not work well in initrd, IIRC.

RA-Kooi commented 1 year ago

It's included by default with dracut. What are the issues with networkd in the initrd? I cannot reproduce his issue on my Xen VM, networkd is restarted as normal. However in my initrd, it is wanted by sysinit.target.

yuwata commented 1 year ago

If networkd.service is wanted by sysinit.target, then initrd-cleanup.service stops networkd.service before switching root, IIRC.

What are the issues with networkd in the initrd?

networkd.service removes some (not all) IP addresses, routes, and friends on stop. Hence, restarting networkd during switching root may disconnect network. So, if the rootfs is backed by a network filesystem, system may fail to boot.

RA-Kooi commented 1 year ago

networkd.service removes some (not all) IP addresses, routes, and friends on stop. Hence, restarting networkd during switching root may disconnect network. So, if the rootfs is backed by a network filesystem, system may fail to boot.

This is solved by

[Network]
KeepConfiguration=yes

If networkd.service is wanted by sysinit.target, then initrd-cleanup.service stops networkd.service before switching root, IIRC.

I presumed so. This is why I think dracut is in the wrong for simply doing systemctl enable systemd-networkd in the initrd. They should probably do an override setting WantedBy=sysinit.target or manually enable the service(s).

yuwata commented 1 year ago

networkd.service removes some (not all) IP addresses, routes, and friends on stop. Hence, restarting networkd during switching root may disconnect network. So, if the rootfs is backed by a network filesystem, system may fail to boot.

This is solved by

[Network]
KeepConfiguration=yes

Yes. That's exactly for such setup. However, currently the setting is not perfect. e.g. DHCPv6 address is not kept, IIRC.

If networkd.service is wanted by sysinit.target, then initrd-cleanup.service stops networkd.service before switching root, IIRC.

I presumed so. This is why I think dracut is in the wrong for simply doing systemctl enable systemd-networkd in the initrd. They should probably do an override setting WantedBy=sysinit.target or manually enable the service(s).

Yeah, at least that should be a good workaround.

RA-Kooi commented 1 year ago

Yes. That's exactly for such setup. However, currently the setting is not perfect. e.g. DHCPv6 address is not kept, IIRC.

Hmmm, I see. I did not run into that issue, as my network with the NFS server is mainly IPv4, but this is good to keep in mind. I did run in an issue where systemd-networkd assigns an extra IP address, but for my specific case that would be fixed by #27706. Not sure if there's a setting that stops networkd from generating an extra ipv4 from the same subnet with real hardware. Though I suppose most setups will use static assignment on the DHCP server.

floppym commented 1 year ago

BTW, why networkd is included in initrd?

dracut automagically includes it if certain system packages are installed.

dracut pulls in systemd-networkd if the "network" module is enabled, and neither connman nor NetworkManager are installed.

https://github.com/dracutdevs/dracut/blob/059/modules.d/40network/module-setup.sh#L19

The "network" module is enabled/required if any network filesystem utilities are installed (eg. nfs-utils, cifs-utils).

https://github.com/dracutdevs/dracut/blob/059/modules.d/95nfs/module-setup.sh#L21

DaanDeMeyer commented 1 year ago

This is extremely easy to reproduce in mkosi btw, it'll happen 90% of the time. Just run systemctl status systemd-networkd after boot and it'll show you that it's enabled yet it'll have exited.

DaanDeMeyer commented 1 year ago

So we send SIGTERM to systemd-networkd during switch-root, but don't actually wait for it to shutdown, which means that if networkd takes a little time to stop, we'll have already deserialized and processed the unit graph in the rootfs, but won't have queued a job for networkd in the rootfs because we think its still running, but then it shuts down and by the time we get to starting systemd-networkd-wait-online, it isn't running anymore and the dependency fails. At least that's what I think is happening here.

livelace commented 1 year ago

it seem that #27791 doesn't work for me. i have the same problem on my gentoo systems (5.14.112, 253.3).

yuwata commented 1 year ago

@livelace Please provide debugging logs.

felixonmars commented 9 months ago

it seem that #27791 doesn't work for me. i have the same problem on my gentoo systems (5.14.112, 253.3).

I think I hit this problem as well. The behavior here suggests that initrd's systemd (pid1) thinks systemd-networkd's service has been stopped already:

image

While it's not, because the same pid still outputs something after switch-root:

image

The service is then left in inactive state after switch-root:

image

I'll try to get some debug logs later.