PID1 enters infinite loop when trying to stop socket with incoming traffic

keszybz commented 1 year ago

systemd version the issue has been seen with

253-rc1

Used distribution

Fedora rawhide

Linux kernel version used

No response

CPU architectures issue was seen on

None

Component

systemd

Expected behaviour you didn't see

https://github.com/systemd/systemd/pull/25687 removed static enablement of systemd-journald-audit.socket. It is now managed by normal enable/disable symlinks and the presets logic. When building the package in Fedora, I forgot to add a scriptlet that'd call systemctl preset and reeanable the socket. That is fixed now, and this bug is about what happens when the socket is (intentionally or not) disabled.

From https://bugzilla.redhat.com/attachment.cgi?id=1940497:

$ less -R fail.log|rg 'audit.socket|switch-root'|head -n 90
unit_file_build_name_map: normal unit file: /usr/lib/systemd/system/systemd-journald-audit.socket
unit_file_build_name_map: normal unit file: /usr/lib/systemd/system/plymouth-switch-root.service
unit_file_build_name_map: normal unit file: /usr/lib/systemd/system/initrd-switch-root.target
unit_file_build_name_map: normal unit file: /usr/lib/systemd/system/initrd-switch-root.service
systemd-journald-audit.socket: Installed new job systemd-journald-audit.socket/start as 14
systemd-journald-audit.socket: ConditionCapability=CAP_AUDIT_READ succeeded.
systemd-journald-audit.socket: ConditionSecurity=audit succeeded.
systemd-journald-audit.socket: Changed dead -> listening
systemd-journald-audit.socket: Job 14 systemd-journald-audit.socket/start finished, result=done
systemd-journald-audit.socket: Incoming traffic
systemd-journald-audit.socket: Changed listening -> running
initrd-cleanup.service: About to execute systemctl --no-block isolate initrd-switch-root.target
initrd-cleanup.service: Executing: systemctl --no-block isolate initrd-switch-root.target
initrd-switch-root.target: Trying to enqueue job initrd-switch-root.target/start/isolate
initrd-switch-root.target: Installed new job initrd-switch-root.target/start as 65
plymouth-switch-root.service: Installed new job plymouth-switch-root.service/start as 91
initrd-switch-root.service: Installed new job initrd-switch-root.service/start as 79
initrd-switch-root.target: Enqueued job initrd-switch-root.target/start as 65
initrd-switch-root.target: starting held back, waiting for: initrd-cleanup.service
initrd-switch-root.service: starting held back, waiting for: plymouth-switch-root.service
plymouth-switch-root.service: ConditionPathExists=/etc/initrd-release succeeded.
plymouth-switch-root.service: Will spawn child (service_enter_start): /usr/bin/plymouth
plymouth-switch-root.service: Passing 0 fds to service
plymouth-switch-root.service: About to execute /usr/bin/plymouth update-root-fs --new-root-dir=/sysroot
plymouth-switch-root.service: Forked /usr/bin/plymouth as 391
plymouth-switch-root.service: Changed dead -> start
plymouth-switch-root.service: Executing: /usr/bin/plymouth update-root-fs --new-root-dir=/sysroot
initrd-switch-root.target: starting held back, waiting for: initrd-cleanup.service
initrd-switch-root.target: starting held back, waiting for: initrd-udevadm-cleanup-db.service
plymouth-switch-root.service: Child 391 belongs to plymouth-switch-root.service.
plymouth-switch-root.service: Main process exited, code=exited, status=0/SUCCESS (success)
plymouth-switch-root.service: Changed start -> exited
plymouth-switch-root.service: Job 91 plymouth-switch-root.service/start finished, result=done
initrd-switch-root.service: starting held back, waiting for: initrd-switch-root.target
initrd-switch-root.target: starting held back, waiting for: initrd-udevadm-cleanup-db.service
initrd-switch-root.target: AssertPathExists=/etc/initrd-release succeeded.
initrd-switch-root.target changed dead -> active
initrd-switch-root.target: Job 65 initrd-switch-root.target/start finished, result=done
[  OK  ] Reached target initrd-switch-root.target - Switch Root.
initrd-switch-root.service: AssertPathExists=/etc/initrd-release succeeded.
initrd-switch-root.service: Will spawn child (service_enter_start): systemctl
initrd-switch-root.service: Passing 0 fds to service
initrd-switch-root.service: About to execute systemctl --no-block switch-root /sysroot
initrd-switch-root.service: Forked systemctl as 394
initrd-switch-root.service: Changed dead -> start
         Starting initrd-switch-root.service - Switch Root...
initrd-switch-root.service: Executing: systemctl --no-block switch-root /sysroot
-.mount: not serializing before switch-root
tmp.mount: not serializing before switch-root
sys-kernel-config.mount: not serializing before switch-root
sysroot.mount: not serializing before switch-root
unit_file_build_name_map: normal unit file: /usr/lib/systemd/system/plymouth-switch-root-initramfs.service
unit_file_build_name_map: normal unit file: /usr/lib/systemd/system/plymouth-switch-root.service
unit_file_build_name_map: normal unit file: /usr/lib/systemd/system/systemd-journald-audit.socket
unit_file_build_name_map: normal unit file: /usr/lib/systemd/system/initrd-switch-root.service
unit_file_build_name_map: normal unit file: /usr/lib/systemd/system/initrd-switch-root.target
initrd-switch-root.target changed dead -> active
plymouth-switch-root.service: Changed dead -> exited
systemd-journald-audit.socket: Changed dead -> running
initrd-switch-root.service: Changed dead -> start
plymouth-switch-root.service: Installed new job plymouth-switch-root.service/stop as 268
initrd-switch-root.service: Installed new job initrd-switch-root.service/stop as 270
initrd-switch-root.target: Installed new job initrd-switch-root.target/stop as 133
systemd-journald-audit.socket: Installed new job systemd-journald-audit.socket/stop as 269
systemd-journald-audit.socket: Changed running -> listening
initrd-switch-root.service: Child 394 belongs to initrd-switch-root.service.
initrd-switch-root.service: Main process exited, code=exited, status=0/SUCCESS (success)
initrd-switch-root.service: Deactivated successfully.
initrd-switch-root.service: Service restart not allowed.
initrd-switch-root.service: Changed start -> dead
initrd-switch-root.service: bpf-lsm: Failed to delete cgroup entry from LSM BPF map: No such file or directory
initrd-switch-root.service: Job 270 initrd-switch-root.service/stop finished, result=done
[  OK  ] Stopped initrd-switch-root.service - Switch Root.
initrd-switch-root.service: Consumed 20ms CPU time.
initrd-switch-root.service: Control group is empty.
systemd-journald-audit.socket: Incoming traffic
systemd-journald-audit.socket: Suppressing connection request since unit stop is scheduled.
systemd-journald-audit.socket: Incoming traffic
systemd-journald-audit.socket: Suppressing connection request since unit stop is scheduled.
systemd-journald-audit.socket: Incoming traffic
systemd-journald-audit.socket: Suppressing connection request since unit stop is scheduled.
systemd-journald-audit.socket: Incoming traffic
systemd-journald-audit.socket: Suppressing connection request since unit stop is scheduled.
systemd-journald-audit.socket: Incoming traffic
systemd-journald-audit.socket: Suppressing connection request since unit stop is scheduled.
systemd-journald-audit.socket: Incoming traffic
systemd-journald-audit.socket: Suppressing connection request since unit stop is scheduled.
systemd-journald-audit.socket: Incoming traffic
systemd-journald-audit.socket: Suppressing connection request since unit stop is scheduled.
systemd-journald-audit.socket: Incoming traffic
...

Unexpected behaviour you saw

No response

Steps to reproduce the problem

No response

Additional program output to the terminal or log subsystem illustrating the issue

No response

fbuihuu commented 1 year ago

I think the problem in your case is that systemd-journald-audit.socket was left enabled in initrd but disabled in the host. Not really sure if this was intended (maybe the initrd was not rebuilt after re-installing the package with journal audit socket change ?) but that had the side effect to let the socket running during the switch root transition and to stop it once PID1 is reexecuted in the host.

Indeed once PID1 was reexecuted in the host, it noticed that the socket was no more needed (since you forgot to call systemctl preset in your package scriptlet) and attempted to stop it, which seems the correct thing to do. However I'm not sure why systemd is spamming the logs like it did when it realized that the socket was still receiving data while being stopped. Maybe we should log such event only once.

keszybz commented 1 year ago

Well, yes. But it should be OK to have a socket enabled in the initrd and disabled in the host. This certainly should not end with an infinite loop and the socket never being closed.

fbuihuu commented 1 year ago

Sure, I was trying to describe what was happening in your case. But I didn't mean that the infinite loop was something expected.

I quickly tried to reproduce the problem but couldn't. The results are a bit different depending on whether the debug logs are enabled or not but I couldn't trigger the infinite loop in both case.

Unfortunately I won't be able to look at this in details before next week.

fbuihuu commented 1 year ago

So we have a logic to disable a listening socket when it has a stop job scheduled to prevent incoming data from triggering unnecessary events. This is done by flushing all incoming connections and draining all data accumulated in the socket buffers.

However the "flushing all incoming connections" logic doesn't apply to netlink sockets, see: https://github.com/systemd/systemd/blob/v253-rc2/src/basic/socket-util.c#L1117

And due to the fact that the socket buffer is also drained, there's always room in the socket buffer to accept more incoming data, which is probably the reason why you're seeing the infinite loop.

At that point, I'm not really sure whether it's worth improving the logic because it only flushes the accumulated incoming connections and data at a given time but it doesn't prevent new ones from triggering new events. So in theory an application that keeps opening the listening socket could trigger the same infinite loop, I think.

Maybe instead we should give more priority to the event source dealing with the run queue than the socket io events has and gives PID1 a chance to proceed with the stop job.

keszybz commented 1 year ago

We discussed this during the video meeting. @poettering's idea: set ratelimit on the socket source.

fbuihuu commented 1 year ago

yes I guess that would work too.

bluca commented 1 year ago

@fbuihuu any update on this?

fbuihuu commented 1 year ago

Not really sorry. And I have no free time to spend on this one currently.

bluca commented 1 year ago

Ok, will move to the next milestone then

bluca commented 1 year ago

https://github.com/systemd/systemd/pull/25687 removed static enablement of systemd-journald-audit.socket. It is now managed by normal enable/disable symlinks and the presets logic.

This also missed that journald uses Sockets=audit.socket so the audit socket is always implicitly enabled anyway

systemd / systemd