rhkdump / kdump-utils

Kernel crash dump collection utilities
GNU General Public License v2.0
3 stars 12 forks source link

kdump over NFS fails for f41 #52

Open jbtrystram opened 2 days ago

jbtrystram commented 2 days ago

Using the following config in fedora coreOS 41 :

          nfs 10.0.2.2:/
          path /crash
          core_collector makedumpfile -l --message-level 1 -d 31
          extra_bins /sbin/mount.nfs 
          extra_modules nfs nfsv3 nfs_layout_nfsv41_files blocklayoutdriver nfs_layout_flexfiles nfs_layout_nfsv41_files

The initramfs indefinitely wait after mounting kdumproot.mount:

[    3.103234] systemd[1]: Mounted kdumproot.mount - /kdumproot.
[  OK  ] Mounted kdumproot.mount - /kdumproot.
[    3.009698] systemd[1]: Mounted kdumproot.mount - /kdumproot.
[    3.106639] systemd[1]: Reached target remote-fs.target - Remote File Systems.

[    3.012510] systemd[1]: Reached target remote-fs.target - Remote File Systems.
[  *** ] Job dev-disk-by\x2dpath-pci\x2d0000…tart running (1min 48s / no limit)

using kexec-tools-2.0.29-1 I tried to downgrade nfs-utils-coreos to the f40 rpm but it does not fix the issue.

The same setup works fine in F40.

coiby commented 1 day ago

I find downgrading systemd to systemd-stable-255.10 on F41 could make kdump work again. Note I built systemd-stable-255.10 from source as latest systemd on F41 is v256.

licliu commented 1 day ago

The other thing I find is that pci-0000:04:00.0-part is only shown in f41, and it is a folder.

ls -lh /dev/disk/by-path/ |grep pci-0000:04:00.0-part$
drwxr-xr-x. 7 root root 140 Oct 25 03:18 pci-0000:04:00.0-part

On f40:

ls -lh /dev/disk/by-path/ |grep pci-0000:04:00.0-part$
echo $?
1
licliu commented 1 day ago

This symbol link is created by /usr/lib/udev/rules.d/60-persistent-storage.rules. Those udev rules are introuced by this commit https://github.com/systemd/systemd/commit/3af66c089b930b7191c1964f2ea30448fe9688de

For unmounted nfs, dracut cannot determine it's fstype findmnt -e -v -n -o 'FSTYPE' --source "$_find_dev" and dracut will use 0:0 as its maj:min, unfortunately, the output of stat -L -c '%t:%T' /dev/disk/by-path/pci-0000:04:00.0-part/ is also 0:0, so dracut treats them as the same device and writes the latter as a persistent name to the dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-path\x2fpci-0000:04:00.0-part.sh.

I think this bug consists of three parts:

  1. nfs is used as the dump target in kdump, and it is not mounted and is not in fstab.
  2. dracut cannot find the fstype of "nfs device" using findmnt, so dracut think it is a local device. And then its maj:min is set to 0:0.
  3. Due to the update of udev rule, a device with maj:min of 0:0 happens to exist in the /dev/disk/by-path directory.

These factors together cause the second kernel to wait for an impossible task - mounting /dev/disk/by-path/pci-0000:04:00.0-part/