openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.61k stars 1.75k forks source link

ZOL 0.8 Not Loading Modules or ZPools on Boot #8885

Open f1yankees opened 5 years ago

f1yankees commented 5 years ago

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 30
Linux Kernel 5.0.17-300
Architecture x86_64
ZFS Version 0.8
SPL Version 0.8

Describe the problem you're observing

After updating to 0.8 the zfs modules fail to load at boot. Additionally, after manually loading the modules, no pools are able to be discovered. A manual zpool import -d /path/to/disks is required every time to get any zpool imported. There is a /etc/zfs/zpool.cache file created but it doesn't seem to matter.

I have replicated this on VM's with Fedora, CentOS (zfs-testing) and OpenSUSE and the result is the same.

This problem was fixed on both OS's by rolling back to 0.7.13

Describe how to reproduce the problem

1) Install ZFS 2) create a dummy zpool on a VM 3) reboot 3) no pools found 5) modprobe zfs 6) zpool import "tank" 7) no pools found 8) import using zpool import tank -d /full/path/

Include any warning/errors/backtraces from the system logs

johnnyjacq16 commented 5 years ago

Ensure that you install zfs-dracut with dnf install zfs-dracut for fedora and yum install zfs-dracut for CentOS from the zfs repos. Re-create the cache file with zpool set cachefile=/etc/zfs/zpool.cache <name of your pool> Ensure the zfs-import-cache service is enabled with systemctl status zfs-import-cache to enable it do systemctl enable zfs-import-cache Re-generate your initramfs with the command dracut -fv --kver 5.1.7-300.fc30.x86_64 for fedora and dracut -fv --kver <your kernel version> for CentOS.

f1yankees commented 5 years ago

Thanks for the info. That was a great lead. It appears there is some issue with zfs-import-cache. This is the error output I get:

[me@localhost ~]$ systemctl status zfs-import-cache
● zfs-import-cache.service - Import ZFS pools by cache file
   Loaded: loaded (/usr/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2019-06-11 19:27:58 CDT; 2min 15s ago
     Docs: man:zpool(8)
  Process: 808 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN (code=exited, status=1/FAILURE)
 Main PID: 808 (code=exited, status=1/FAILURE)

Jun 11 19:27:58 localhost.localdomain systemd[1]: Starting Import ZFS pools by cache file...
Jun 11 19:27:58 localhost.localdomain zpool[808]: The ZFS modules are not loaded.
Jun 11 19:27:58 localhost.localdomain zpool[808]: Try running '/sbin/modprobe zfs' as root to load them.
Jun 11 19:27:58 localhost.localdomain systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Jun 11 19:27:58 localhost.localdomain systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Jun 11 19:27:58 localhost.localdomain systemd[1]: Failed to start Import ZFS pools by cache file.
johnnyjacq16 commented 5 years ago

In your log it shows: Jun 11 19:27:58 localhost.localdomain zpool[808]: The ZFS modules are not loaded. The ZFS modules are not loaded please run /sbin/modprobe zfs as root to load them. If the zfs module loaded without issues, do systemctl restart zfs-import-cache, thensystemctl status zfs-import-cache to check that the service is now running then continue with the above instructions If there are issues loading the modules, please ensure that zfs is installed properly, if you have installed zfs from the zfs repo make sure that zfs-dkms is installed with rpm -qa | grep zfs-dkms if not installed run dnf install zfs-dkms for fedora and yum install zfs-dkms for CentOS also ensure that the modules has been built for your kernel by dkms with the following command dkms status.

f1yankees commented 5 years ago

Thanks for your reply.

I was able to load the zfs modules without issue with modprobe zfs. After that, I restarted zfs-import-cache and it looks fine:

[me@localhost ~]$ systemctl status zfs-import-cache
● zfs-import-cache.service - Import ZFS pools by cache file
   Loaded: loaded (/usr/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
   Active: active (exited) since Tue 2019-06-11 20:44:02 CDT; 1min 32s ago
     Docs: man:zpool(8)
  Process: 3833 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN (code=exited, status=0/SUCCESS)
 Main PID: 3833 (code=exited, status=0/SUCCESS)

Jun 11 20:44:02 localhost.localdomain systemd[1]: Starting Import ZFS pools by cache file...
Jun 11 20:44:02 localhost.localdomain zpool[3833]: no pools available to import
Jun 11 20:44:02 localhost.localdomain systemd[1]: Started Import ZFS pools by cache file.

And the dracut output:

[me@localhost ~]$ sudo dracut -fv --kver 5.0.17-300.fc30.x86_64 
dracut: Executing: /usr/bin/dracut -fv --kver 5.0.17-300.fc30.x86_64
dracut: zfsexpandknowledge: host device /dev/dm-0
dracut: zfsexpandknowledge: host device /dev/sda1
dracut: zfsexpandknowledge: host device /dev/dm-1
dracut: zfsexpandknowledge: device /dev/dm-0 of type ext4
dracut: zfsexpandknowledge: device /dev/dm-1 of type swap
dracut: zfsexpandknowledge: device /dev/sda2 of type LVM2_member
dracut: zfsexpandknowledge: device /dev/sda1 of type ext4
dracut: dracut module 'busybox' will not be installed, because command 'busybox' could not be found!
dracut: dracut module 'stratis' will not be installed, because command 'stratisd-init' could not be found!
dracut: dracut module 'biosdevname' will not be installed, because command 'biosdevname' could not be found!
dracut: zfsexpandknowledge: host device /dev/dm-0
dracut: zfsexpandknowledge: host device /dev/sda1
dracut: zfsexpandknowledge: host device /dev/dm-1
dracut: zfsexpandknowledge: device /dev/dm-0 of type ext4
dracut: zfsexpandknowledge: device /dev/dm-1 of type swap
dracut: zfsexpandknowledge: device /dev/sda2 of type LVM2_member
dracut: zfsexpandknowledge: device /dev/sda1 of type ext4
dracut: dracut module 'busybox' will not be installed, because command 'busybox' could not be found!
dracut: dracut module 'stratis' will not be installed, because command 'stratisd-init' could not be found!
dracut: *** Including module: bash ***
dracut: *** Including module: systemd ***
dracut: *** Including module: systemd-initrd ***
dracut: *** Including module: nss-softokn ***
dracut: *** Including module: i18n ***
dracut: *** Including module: network-legacy ***
dracut: *** Including module: network ***
dracut: *** Including module: ifcfg ***
dracut: *** Including module: drm ***
dracut: *** Including module: plymouth ***
dracut: *** Including module: dm ***
dracut: Skipping udev rule: 64-device-mapper.rules
dracut: Skipping udev rule: 60-persistent-storage-dm.rules
dracut: Skipping udev rule: 55-dm.rules
dracut: *** Including module: kernel-modules ***
dracut: *** Including module: kernel-modules-extra ***
dracut: *** Including module: kernel-network-modules ***
dracut: *** Including module: lvm ***
dracut: Skipping udev rule: 64-device-mapper.rules
dracut: Skipping udev rule: 56-lvm.rules
dracut: Skipping udev rule: 60-persistent-storage-lvm.rules
dracut: *** Including module: zfs ***
dracut: *** Including module: resume ***
dracut: *** Including module: rootfs-block ***
dracut: *** Including module: terminfo ***
dracut: *** Including module: udev-rules ***
dracut: Skipping udev rule: 40-redhat.rules
dracut: Skipping udev rule: 50-firmware.rules
dracut: Skipping udev rule: 50-udev.rules
dracut: Skipping udev rule: 91-permissions.rules
dracut: Skipping udev rule: 80-drivers-modprobe.rules
dracut: *** Including module: dracut-systemd ***
dracut: *** Including module: usrmount ***
dracut: *** Including module: base ***
dracut: *** Including module: fs-lib ***
dracut: *** Including module: shutdown ***
dracut: *** Including modules done ***
dracut: *** Installing kernel module dependencies ***
dracut: *** Installing kernel module dependencies done ***
dracut: *** Resolving executable dependencies ***
dracut: *** Resolving executable dependencies done ***
dracut: *** Hardlinking files ***
dracut: *** Hardlinking files done ***
dracut: *** Stripping files ***
dracut: *** Stripping files done ***
dracut: *** Generating early-microcode cpio image ***
dracut: *** Constructing GenuineIntel.bin ***
dracut: *** Store current command line parameters ***
dracut: *** Creating image file '/boot/initramfs-5.0.17-300.fc30.x86_64.img' ***
dracut: *** Creating initramfs image file '/boot/initramfs-5.0.17-300.fc30.x86_64.img' done ***

However, the steps did not resolve the issue. After reboot, zfs modules are not loaded and the zpool only can be imported with -d and /path/to/disk

johnnyjacq16 commented 5 years ago

Check your current grub configuration with cat /proc/cmdline and cat /etc/default/grub

sayap commented 5 years ago

create a dummy zpool on a VM

For this step, are you using files or actual block devices? I believe the kernel module auto-loading from #7287 only works for actual block devices.

f1yankees commented 5 years ago

I am using files, not physical drives. I was just testing out 0.8 on VM before I push it out to my production servers. Is this only an issue for files, and not drives?

Here is my output:

[me@localhost ~]$ cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.0.17-300.fc30.x86_64 root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet

[me@localhost ~]$ cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true

Thanks for your replies.

jameslikeslinux commented 5 years ago

I have seen this problem on a couple of my systems and I tracked it down to a udev race condition. Namely:

  1. zfs-import-cache depends on systemd-udev-settle
  2. systemd-udev-settle depends on systemd-udev-trigger
  3. systemd-udev-trigger causes udev to scan and process all of the devices at boot
  4. udev is responsible for loading the zfs kernel module when it finds a zfs device (/lib/udev/rules.d/90-zfs.rules)

What I observed was that systemd-udev-settle was finishing before all of the devices have been processed and as a result, the zpool import was running before the devices were ready, leading to a failed import:

May 31 17:10:59 eagle systemd[1]: systemd 241 running in system mode. (+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL -XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
...
May 31 17:10:59 eagle systemd[1]: Started udev Wait for Complete Device Initialization.
May 31 17:10:59 eagle systemd[1]: Starting Import ZFS pools by cache file...
May 31 17:10:59 eagle systemd[1]: Condition check resulted in Import ZFS pools by device scanning being skipped.
May 31 17:10:59 eagle kernel:  nvme0n1: p1 p2 p3
May 31 17:10:59 eagle kernel: usb 3-8: new high-speed USB device number 2 using xhci_hcd
May 31 17:11:00 eagle kernel: usb 3-8: New USB device found, idVendor=0451, idProduct=8142, bcdDevice= 1.00
May 31 17:11:00 eagle kernel: usb 3-8: New USB device strings: Mfr=0, Product=0, SerialNumber=1
May 31 17:11:00 eagle kernel: usb 3-8: SerialNumber: 13000869458A
May 31 17:11:00 eagle kernel: hub 3-8:1.0: USB hub found
May 31 17:11:00 eagle kernel: hub 3-8:1.0: 4 ports detected
May 31 17:11:00 eagle kernel: usb 3-11: new high-speed USB device number 3 using xhci_hcd
May 31 17:11:00 eagle kernel: usb 3-11: New USB device found, idVendor=0835, idProduct=2800, bcdDevice=41.43
May 31 17:11:00 eagle kernel: usb 3-11: New USB device strings: Mfr=1, Product=2, SerialNumber=0
May 31 17:11:00 eagle kernel: usb 3-11: Product: USB2.0 Hub
May 31 17:11:00 eagle kernel: usb 3-11: Manufacturer: GenesysLogic
May 31 17:11:00 eagle kernel: hub 3-11:1.0: USB hub found
May 31 17:11:00 eagle kernel: hub 3-11:1.0: 4 ports detected
May 31 17:11:00 eagle kernel: usb 3-8.1: new high-speed USB device number 4 using xhci_hcd
May 31 17:11:00 eagle kernel: usb 3-8.1: New USB device found, idVendor=0451, idProduct=8142, bcdDevice= 1.00
May 31 17:11:00 eagle kernel: usb 3-8.1: New USB device strings: Mfr=0, Product=0, SerialNumber=1
May 31 17:11:00 eagle kernel: usb 3-8.1: SerialNumber: 24010869458A
May 31 17:11:00 eagle kernel: hub 3-8.1:1.0: USB hub found
May 31 17:11:00 eagle kernel: hub 3-8.1:1.0: 4 ports detected
May 31 17:11:00 eagle kernel: usb 3-13: new full-speed USB device number 5 using xhci_hcd
May 31 17:11:00 eagle systemd-modules-load[2084]: Inserted module 'zfs'
May 31 17:11:00 eagle kernel: ZFS: Loaded module v0.8.0-r0-gentoo, ZFS pool version 5000, ZFS filesystem version 5
May 31 17:11:00 eagle kernel: systemd-modules (2084) used greatest stack depth: 13608 bytes left
May 31 17:11:00 eagle systemd[1]: Started Load Kernel Modules.
May 31 17:11:00 eagle systemd[1]: Starting Apply Kernel Variables...
May 31 17:11:00 eagle zpool[3742]: cannot import 'eagle': no such pool or dataset
May 31 17:11:00 eagle kernel: random: fast init done
May 31 17:11:00 eagle kernel: zpool (3742) used greatest stack depth: 13400 bytes left
May 31 17:11:00 eagle zpool[3742]:         Destroy and re-create the pool from
May 31 17:11:00 eagle zpool[3742]:         a backup source.
May 31 17:11:00 eagle systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
May 31 17:11:00 eagle systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
May 31 17:11:00 eagle systemd[1]: Failed to start Import ZFS pools by cache file.

I worked around the problem by adding a third ExecStart to /lib/systemd/system/systemd-udev-trigger.service:

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/udevadm trigger --type=subsystems --action=add
ExecStart=/bin/udevadm trigger --type=devices --action=add
ExecStart=/bin/udevadm trigger --type=devices --action=change

Then I rebuilt the dracut image. The "trigger changes" action seems to cause udev to fully wait for all devices to be processed before systemd-udev-settle is allowed to exit. I got this workaround from a related issue: https://github.com/systemd/systemd/issues/11255#issuecomment-449681593.

I believe this udev race condition has always existed in systemd because I haven't been able to find any recent udev changes that would explain the problem. I think the timing changes resulting from https://github.com/zfsonlinux/zfs/commit/a6cc97566c54255c98612c801f2c9d9be2d0e8f8 is what has uncovered the problem.

sayap commented 5 years ago

I am using files, not physical drives. I was just testing out 0.8 on VM before I push it out to my production servers. Is this only an issue for files, and not drives?

Yes, I think using files only is an use case that wasn't covered by #7287. Try to attach another disk to the VM, and then create a pool on it. Make sure to enable services zfs-import-cache and zfs-mount. Then, after a reboot, everything should work (except when you hit the race condition outlined by @iamjamestl, which I haven't encountered yet from limited testing).

f1yankees commented 5 years ago

I backed everything up on one of my systems, upgraded to 0.8.1, and took the plunge. You guys were right, it works fine on physical disks. It does not work on files. Thanks for identifying.

ianc1215 commented 5 years ago

I have noticed what I am thinking is the same issue when I installed a new ZFS based root on CentOS 7.6. When the machine would boot it would immediately flip out and drop me into a dracut emergency shell. I kept seeing the same error The ZFS modules are not loaded please run /sbin/modprobe zfs as root to load them When I tried to load the modules the system said they weren't there. I took a peak inside my initrd for the kernel. I think I found the issue. zfs.ko is not being added to the initrd. I captured a log file showing this taking place during the construction of the initrd.

dracut.log

I caught a line of the dracut output as it was building saying "failed to install module zfs". This was even after I added zfs explicitly to a config file in /etc/dracut.conf.d/zfs.conf.

zfs.conf.txt

I am running ZFS 0.8.1 out the zfs-testing repo since I wanted access to the mountpoint generator. Any idea what is going on? Had anyone got an idea of a work around if it is in fact a bug?

ianc1215 commented 5 years ago

I can confirm that I am seeing the same bug on CentOS 7.6 with ZFS 0.8.1 from the zfs-testing repo.

dewi-ny-je commented 5 years ago

I had the same issue on Ubuntu.

I had to manually enable all the services related to ZFS to get the pools on boot, as suggested on https://unix.stackexchange.com/a/338522/118068

sudo systemctl enable zfs-import-cache.service
sudo systemctl enable zfs-import-scan.service
sudo systemctl enable zfs-mount.service
sudo systemctl enable zfs-share.service
sudo systemctl enable zfs-zed.service
sudo systemctl enable zfs.target

The installation of the packages should already enable these services, since 0.8.0 is no more in beta...

cf-sewe commented 5 years ago

Issue still exists with ZFS 0.8.1 on Centos 7.6 Virtual Machine . ZFS module get loaded too late and thus mount of ZFS root pool fails. Boot ends in dracut emergency shell.

Has anybody found a workaround?

dreamcat4 commented 5 years ago

Well I'm on ubuntu 19.04 / zfs 0.8.1 over here and kindda stumped by this. I can manually run the following command:

sudo systemctl start zfs-import-cache

And it will import my zpool by manually starting that systemd service myself, after boot time. However the service is enabled so it is supposed to come up automatically after next reboot. But it does not!

Things I have tried so far:

# add zfs modules to initramfs
# grep -q -E "^zfs$" /etc/modules || echo "zfs" | sudo tee -a /etc/modules > /dev/null
grep -q -E "^zfs$" /etc/initramfs-tools/modules || echo "zfs" | sudo tee -a /etc/initramfs-tools/modules > /dev/null
sudo update-initramfs -c -k all
sudo update-grub

# enable remounting after boot
# sudo systemctl enable zfs-import.target
# sudo systemctl enable zfs-mount.target
sudo systemctl preset zfs-import-cache zfs-import-scan zfs-mount zfs-share zfs-zed zfs.target

# this needs to be manually enabled
sudo systemctl enable zfs-import-cache
sudo systemctl start zfs-import-cache

sudo systemctl enable zfs-import-cache.service
sudo systemctl enable zfs-import-scan.service
sudo systemctl enable zfs-mount.service
sudo systemctl enable zfs-share.service
sudo systemctl enable zfs-zed.service
sudo systemctl enable zfs.target

# make sure it's enabled
grep -q "ZFS_MOUNT" /etc/default/zfs 2> /dev/null || echo "
ZFS_MOUNT=yes
ZFS_UNMOUNT=yes
" | sudo tee -a /etc/default/zfs

And also appended 1 extra line at the end of this file too:

$ cat /lib/systemd/system/systemd-udev-trigger.service
#  SPDX-License-Identifier: LGPL-2.1+
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=udev Coldplug all Devices
Documentation=man:udev(7) man:systemd-udevd.service(8)
DefaultDependencies=no
Wants=systemd-udevd.service
After=systemd-udevd-kernel.socket systemd-udevd-control.socket
Before=sysinit.target
ConditionPathIsReadWrite=/sys

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/udevadm trigger --type=subsystems --action=add
ExecStart=/bin/udevadm trigger --type=devices --action=add
ExecStart=/bin/udevadm trigger --type=devices --action=change

So what else haven't I tried yet, at this point ?

ALso: isn't there some command in systemd, that lets us see the sequence of events that occured during the entire boot sequence? I think this can be output by some simple cmdline. I just forget what that is, the command to run to show that. All the unit files dependancies, etc.

id:~] $ sudo systemctl status zfs-import-cache
● zfs-import-cache.service - Import ZFS pools by cache file
   Loaded: loaded (/usr/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
   Active: active (exited) since Tue 2019-08-20 16:32:15 BST; 4min 4s ago
     Docs: man:zpool(8)
  Process: 7156 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN (code=exited, status=0/SUCCESS)
 Main PID: 7156 (code=exited, status=0/SUCCESS)

Aug 20 16:32:14 apex systemd[1]: Starting Import ZFS pools by cache file...
Aug 20 16:32:15 apex systemd[1]: Started Import ZFS pools by cache file.
[id:~] $ sudo journalctl -u zfs-import-cache
-- Logs begin at Mon 2019-07-22 15:24:00 BST, end at Tue 2019-08-20 16:36:20 BST. --
Aug 20 13:09:28 apex systemd[1]: Starting Import ZFS pools by cache file...
Aug 20 13:09:29 apex systemd[1]: Started Import ZFS pools by cache file.

<above is the manual starting of the service on the commandline with with typing 'systemctl start...'>

... in between these 2 timestamps in the log here above ^^ and below vv, I have enabled the service and rebooted the computer multiple times.

-- Reboot --

<below is the most recent manual starting of the service on the commandline with with typing 'systemctl start...'>

Aug 20 16:32:14 apex systemd[1]: Starting Import ZFS pools by cache file...
Aug 20 16:32:15 apex systemd[1]: Started Import ZFS pools by cache file.
[id:~] $ 
dreamcat4 commented 5 years ago

this one was not enabled!

sudo systemctl enable zfs-import.target

Now it works. So we can at least say that one is essential.

However (as you can see in my previous commentary above ^^), there were also a lot of other things I enabled without clearly knowing which other mechanisms(s) were actually necessary to enable. So it's still a rather confusing picture. Especially if you need to track down some other bug, and do not know what are the likely culprits / where to look.

dewi-ny-je commented 5 years ago

Last time I compiled 0.8.0rc? I also found those services disabled and I had to enable them manually to get the pools on reboot.

I wonder why they were not enabled automatically, they are supposed to be, check the sources for the folders related to systemd.

dreamcat4 commented 5 years ago

Yeah it is strange, because if you type systemctl status <service-name> or do a systemctl preset <service-name>, then it shows that the OS default should be enabled. And yet these services are not enabled after the pkg install.

IMHO what is really missing here is just a simple and very crystal clear set official installation instructions for 0.8.x and newer versions of zfs on debian / ubuntu. That include those few missing cmdline steps to run after doing the apt-get install ... step. Perhaps this is already up in the zfs wiki somewhere @rlaager ?

We can see there is a brief Getting Started page there on the wiki. But it does not mention such steps to enable the import on reboot.

Then there is the new wiki page for zfs root. But with all due respect that is a more complicated page, and it does not specifically say there either:

That a user needs to manually enable zfs-import.target. Which seemed to be necessary at least on my newer zfs 0.8.1 release. Unless there is something else I am missing here, like it only working that way for new clean installs, and not working for when a user upgrades over an older zfs 0.7.12 version. IDK. I am just guessing at this point.

padrino121 commented 5 years ago

Struggled with this as well for some time, first tried this with no luck:

` $ cat /lib/systemd/system/systemd-udev-trigger.service

[Unit] Description=udev Coldplug all Devices Documentation=man:udev(7) man:systemd-udevd.service(8) DefaultDependencies=no Wants=systemd-udevd.service After=systemd-udevd-kernel.socket systemd-udevd-control.socket Before=sysinit.target ConditionPathIsReadWrite=/sys

[Service] Type=oneshot RemainAfterExit=yes ExecStart=/bin/udevadm trigger --type=subsystems --action=add ExecStart=/bin/udevadm trigger --type=devices --action=add ExecStart=/bin/udevadm trigger --type=devices --action=change `

Also enabled all zfs services, regenerated my cache, etc. In the end with 0.8.x (tried .0 and .1) the issue is the disks being ready, I have a total of 36 HDDs in the system and with 0.8.x can't seem to have it event properly to bring the pool up.

I ended up adding a sleep to the settle service to get the pool mounting at boot every time, don't like it but it's working...

`$ cat /lib/systemd/system/systemd-udev-settle.service

[Unit] Description=udev Wait for Complete Device Initialization Documentation=man:udev(7) man:systemd-udevd.service(8) DefaultDependencies=no Wants=systemd-udevd.service After=systemd-udev-trigger.service Before=sysinit.target ConditionPathIsReadWrite=/sys

[Service] Type=oneshot TimeoutSec=180 RemainAfterExit=yes ExecStartPre=/usr/bin/sleep 30 ExecStart=/bin/udevadm settle `

dewi-ny-je commented 5 years ago

After installing or updating zfs run the following and pools will be loaded at boot normally, provided that they were imported at shutdown/reboot:

sudo systemctl preset zfs-import.target
sudo systemctl preset zfs-mount.service
sudo systemctl preset zfs.target
sudo systemctl preset zfs-zed.service
sudo systemctl preset zfs-import-cache.service
memmerson commented 4 years ago

The zfs module was not autoloading for me on Centos 8. I had to manually specify the zfs module to load.

As superuser: echo "zfs" > /etc/modules-load.d/zfs.conf

noc0lour commented 4 years ago

I have seen this problem on a couple of my systems and I tracked it down to a udev race condition. Namely:

1. zfs-import-cache depends on systemd-udev-settle

2. systemd-udev-settle depends on systemd-udev-trigger

3. systemd-udev-trigger causes udev to scan and process all of the devices at boot

4. udev is responsible for loading the zfs kernel module when it finds a zfs device (`/lib/udev/rules.d/90-zfs.rules`)

What I observed was that systemd-udev-settle was finishing before all of the devices have been processed and as a result, the zpool import was running before the devices were ready, leading to a failed import:

May 31 17:10:59 eagle systemd[1]: systemd 241 running in system mode. (+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL -XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
...
May 31 17:10:59 eagle systemd[1]: Started udev Wait for Complete Device Initialization.
May 31 17:10:59 eagle systemd[1]: Starting Import ZFS pools by cache file...
May 31 17:10:59 eagle systemd[1]: Condition check resulted in Import ZFS pools by device scanning being skipped.
May 31 17:10:59 eagle kernel:  nvme0n1: p1 p2 p3
May 31 17:10:59 eagle kernel: usb 3-8: new high-speed USB device number 2 using xhci_hcd
May 31 17:11:00 eagle kernel: usb 3-8: New USB device found, idVendor=0451, idProduct=8142, bcdDevice= 1.00
May 31 17:11:00 eagle kernel: usb 3-8: New USB device strings: Mfr=0, Product=0, SerialNumber=1
May 31 17:11:00 eagle kernel: usb 3-8: SerialNumber: 13000869458A
May 31 17:11:00 eagle kernel: hub 3-8:1.0: USB hub found
May 31 17:11:00 eagle kernel: hub 3-8:1.0: 4 ports detected
May 31 17:11:00 eagle kernel: usb 3-11: new high-speed USB device number 3 using xhci_hcd
May 31 17:11:00 eagle kernel: usb 3-11: New USB device found, idVendor=0835, idProduct=2800, bcdDevice=41.43
May 31 17:11:00 eagle kernel: usb 3-11: New USB device strings: Mfr=1, Product=2, SerialNumber=0
May 31 17:11:00 eagle kernel: usb 3-11: Product: USB2.0 Hub
May 31 17:11:00 eagle kernel: usb 3-11: Manufacturer: GenesysLogic
May 31 17:11:00 eagle kernel: hub 3-11:1.0: USB hub found
May 31 17:11:00 eagle kernel: hub 3-11:1.0: 4 ports detected
May 31 17:11:00 eagle kernel: usb 3-8.1: new high-speed USB device number 4 using xhci_hcd
May 31 17:11:00 eagle kernel: usb 3-8.1: New USB device found, idVendor=0451, idProduct=8142, bcdDevice= 1.00
May 31 17:11:00 eagle kernel: usb 3-8.1: New USB device strings: Mfr=0, Product=0, SerialNumber=1
May 31 17:11:00 eagle kernel: usb 3-8.1: SerialNumber: 24010869458A
May 31 17:11:00 eagle kernel: hub 3-8.1:1.0: USB hub found
May 31 17:11:00 eagle kernel: hub 3-8.1:1.0: 4 ports detected
May 31 17:11:00 eagle kernel: usb 3-13: new full-speed USB device number 5 using xhci_hcd
May 31 17:11:00 eagle systemd-modules-load[2084]: Inserted module 'zfs'
May 31 17:11:00 eagle kernel: ZFS: Loaded module v0.8.0-r0-gentoo, ZFS pool version 5000, ZFS filesystem version 5
May 31 17:11:00 eagle kernel: systemd-modules (2084) used greatest stack depth: 13608 bytes left
May 31 17:11:00 eagle systemd[1]: Started Load Kernel Modules.
May 31 17:11:00 eagle systemd[1]: Starting Apply Kernel Variables...
May 31 17:11:00 eagle zpool[3742]: cannot import 'eagle': no such pool or dataset
May 31 17:11:00 eagle kernel: random: fast init done
May 31 17:11:00 eagle kernel: zpool (3742) used greatest stack depth: 13400 bytes left
May 31 17:11:00 eagle zpool[3742]:         Destroy and re-create the pool from
May 31 17:11:00 eagle zpool[3742]:         a backup source.
May 31 17:11:00 eagle systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
May 31 17:11:00 eagle systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
May 31 17:11:00 eagle systemd[1]: Failed to start Import ZFS pools by cache file.

I worked around the problem by adding a third ExecStart to /lib/systemd/system/systemd-udev-trigger.service:

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/udevadm trigger --type=subsystems --action=add
ExecStart=/bin/udevadm trigger --type=devices --action=add
ExecStart=/bin/udevadm trigger --type=devices --action=change

Then I rebuilt the dracut image. The "trigger changes" action seems to cause udev to fully wait for all devices to be processed before systemd-udev-settle is allowed to exit. I got this workaround from a related issue: systemd/systemd#11255 (comment).

I believe this udev race condition has always existed in systemd because I haven't been able to find any recent udev changes that would explain the problem. I think the timing changes resulting from a6cc975 is what has uncovered the problem.

According to the referenced systemd issue it is supposed to be fixed with v240-2 but I still observe the race condition with v244 and the provided fix added to an override in /etc/systemd/system/systemd-udev-settle.service.d/ solves the issue.

rlaager commented 4 years ago

FWIW, the approach of "waiting for all devices on the system" is fundamentally broken. We really need something that incrementally builds up a list of devices and triggers the import of a particular pool once either A) all the devices for that pool are present, or B) enough devices are present to import that pool and a timeout has expired.

lastb0isct commented 4 years ago

I'm getting this same issue -- zfs-import-cache service failure. It could be because my drives are not fully up at the time the service is starting (as @rlaager suggested above).

Here is the error:

Sep 21 09:23:59 PLEX.morty zpool[1306]: cannot import Volume1: no such pool or dataset
Sep 21 09:24:00 PLEX.morty zpool[1306]: cannot import Volume2: no such pool or dataset
Sep 21 09:24:01 PLEX.morty zpool[1306]: cannot import Volume3: no such pool or dataset
Sep 21 09:24:01 PLEX.morty zpool[1306]:         Destroy and re-create the pool from
Sep 21 09:24:01 PLEX.morty zpool[1306]:         a backup source.
Sep 21 09:24:01 PLEX.morty zpool[1306]:         Destroy and re-create the pool from
Sep 21 09:24:01 PLEX.morty zpool[1306]:         a backup source.
Sep 21 09:24:01 PLEX.morty zpool[1306]:         Destroy and re-create the pool from
Sep 21 09:24:01 PLEX.morty zpool[1306]:         a backup source.
Sep 21 09:24:01 PLEX.morty systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Sep 21 09:24:01 PLEX.morty systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Sep 21 09:24:01 PLEX.morty systemd[1]: Failed to start Import ZFS pools by cache file.

No those aren't actually my zpool names ;)

lastb0isct commented 4 years ago

I ended up adding a sleep to the settle service to get the pool mounting at boot every time, don't like it but it's working...

`$ cat /lib/systemd/system/systemd-udev-settle.service

[Unit] Description=udev Wait for Complete Device Initialization Documentation=man:udev(7) man:systemd-udevd.service(8) DefaultDependencies=no Wants=systemd-udevd.service After=systemd-udev-trigger.service Before=sysinit.target ConditionPathIsReadWrite=/sys

[Service] Type=oneshot TimeoutSec=180 RemainAfterExit=yes ExecStartPre=/usr/bin/sleep 30 ExecStart=/bin/udevadm settle `

I can confirm that this fixed my issue...things are mounting on boot now. Would be nice if we had this built in moving forward.

MichaelHierweck commented 4 years ago

I can confirm this issue on Debian 10 with ZFS 8.4. Depending on the timing zpools may not be detected/imported or are imported in degraded state.

Adding a decent delay to "/lib/systemd/system/systemd-udev-settle.service" as already suggested helps.

Depending on systemd-udev-settle.service does not seem to be a solution. See notes about systemd-udev-settle.service.

unilock commented 3 years ago

I can once again confirm this issue on Raspberry Pi OS Lite arm64 (Debian testing (11 / bullseye)) on my Raspberry Pi 4B with 4GB RAM. My ZFS volume is setup as a hardware-based RAID0 in a dual HDD enclosure over USB. I suspect the issue is due to the USB enclosure not being enumerated until later in the boot process, which is why the above solution, adding ExecStartPre=... to systemd-udev-settle.service, works fine (for now). Notably, my sleep binary is located at /bin/sleep, and I was able to get away with setting the sleep time to just 10 seconds.

EDIT: And this "fix" seems to be reset every time systemd updates, so a better solution is much preferred.