systemd / systemd

The systemd System and Service Manager
https://systemd.io
GNU General Public License v2.0
13.07k stars 3.74k forks source link

systemd-shutdown hang all the timen (if DM devices with no targets exist) #34283

Open huyubiao opened 1 week ago

huyubiao commented 1 week ago

systemd version the issue has been seen with

255

Used distribution

Fedora 39

Linux kernel version used

kernel-6.6.0-136.x86_64

CPU architectures issue was seen on

None

Component

systemd

Expected behaviour you didn't see

The computer is shut down or restarted.

Unexpected behaviour you saw

systemd-shutdown hang all the time

Steps to reproduce the problem

During the shutdown, all DMs are detached. The fsync_path_at() is delivered , and then the ioctl() is executed to delete the DMs. However, the DMs are hanged in the fsync:

static int delete_dm(DeviceMapper *m) {
...
        r = fsync_path_at(AT_FDCWD, m->path);
        if (r < 0)
                log_debug_errno(r, "Failed to sync DM block device %s, ignoring: %m", m->path);

        return RET_NERRNO(ioctl(fd, DM_DEV_REMOVE, &(struct dm_ioctl) {
...
}

According to the dm device information parsed by the vmcore, the dm has been created successfully, but the mapping table has not been loaded. The I/O delivered by the fsync is delayed. However, the systemd-shutdown process has been started and all processes are stopped,and the mapping table cannot be loaded. As a result, the I/O is suspended, and systemd-shutdown is hanged as well.

In addition to /dev/watchdog(The /dev/watchdog and the /dev/watchdog0 device does not exist on my machine), are there any other methods that can prevent the system from being stuck? () Or check whether the DM mapping table exists before fsync()?

Additional program output to the terminal or log subsystem illustrating the issue

No response

poettering commented 1 week ago

I prepped an fix in #34330, that should address the issue here. It's entirely generic and puts a global timeout on any kind of fsync() we issue from the shutdown binary. It does not try to be smart in anyway, but that's a good ting, and should mean we won't hang unbounded here.

As I understand this is triggered by a local mishandling of the device. I think it#s hence fine if we handle this via a timeout here. We don't have to try to be "clean" in shutdown.c because the triggering condition isn't clean either.

poettering commented 1 week ago

Aynway, would be great if you could check if #34330 makes the issue go away for you?