Open stuartthebruce opened 3 years ago
I have now seen this on a second CentOS 8.3 + ZFS 2.0.2 system with a similarly configured 60-drive external SAS JBOD pool
Perhaps this is a bug in one of the Require or After services for zfs-import-cache, or additional dependencies should be added?
[root@gwosc-zfs1 ~]# cat /etc/systemd/system/zfs-import.target.wants/zfs-import-cache.service
[Unit]
Description=Import ZFS pools by cache file
Documentation=man:zpool(8)
DefaultDependencies=no
Requires=systemd-udev-settle.service
After=systemd-udev-settle.service
After=cryptsetup.target
After=multipathd.target
After=systemd-remount-fs.service
Before=zfs-import.target
ConditionPathExists=/etc/zfs/zpool.cache
ConditionPathIsDirectory=/sys/module/zfs
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN
[Install]
WantedBy=zfs-import.target
Looks like this might be a duplicate of #10891?
As recommended at https://zfsonlinux.topicbox.com/groups/zfs-discuss/Td5ea0587e058439b-M40eec800f31527397ca91aa2 the following works as a temporary solution,
ExecStartPre=/usr/bin/sleep 30
I'm also seeing this on my Ubuntu Focal server. 10+ disks are connected via an HBA. Adding a sleep timeout works for me too.
On Fedora 33 with two small (2 disk) luks encrypted arrays I've been had to work through a couple of similar problems...
echo "zfs" > /etc/modules-load.d/zfs.conf
systemctl edit zfs-import-cache.service
Add:
[Unit]
Requires=systemd-modules-load.service
After=systemd-modules-load.service
systemctl edit zfs-import-cache.service
Add:
[Service]
ExecStartPre=/usr/bin/sleep 5
This issue is still an issue (arguably worse) with ZFS 2.1.0. In particular, a Rocky Linux 8.4 test zpool with qty 91 multipathd managed FC LUNs generates the following error at boot with ZFS 2.0.5 (without the above unit file edit),
[root@zfsbackup1 ~]# systemctl status zfs-import-cache
● zfs-import-cache.service - Import ZFS pools by cache file
Loaded: loaded (/usr/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-07-06 16:41:48 PDT; 1min 33s ago
Docs: man:zpool(8)
Process: 14276 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN (code=exited, status=1/FAILURE)
Main PID: 14276 (code=exited, status=1/FAILURE)
Jul 06 16:41:45 zfsbackup1 zpool[14276]: cannot import 'temp': no such pool or dataset
Jul 06 16:41:48 zfsbackup1 zpool[14276]: cannot import 'temp': no such pool or dataset
Jul 06 16:41:48 zfsbackup1 zpool[14276]: Destroy and re-create the pool from
Jul 06 16:41:48 zfsbackup1 zpool[14276]: a backup source.
Jul 06 16:41:48 zfsbackup1 zpool[14276]: cachefile import failed, retrying
Jul 06 16:41:48 zfsbackup1 zpool[14276]: Destroy and re-create the pool from
Jul 06 16:41:48 zfsbackup1 zpool[14276]: a backup source.
Jul 06 16:41:48 zfsbackup1 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/>
Jul 06 16:41:48 zfsbackup1 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Jul 06 16:41:48 zfsbackup1 systemd[1]: Failed to start Import ZFS pools by cache file.
and after patching to ZFS 2.1.0 a reboot spins its wheels for 3+ hours before I gave up,
Note, this system is able to successfully run zfs-import-cache at boot with both ZFS 2.0.5 and 2.1.0 using the above unit file edit.
Thanks for confirming this is still and issue with 2.1.
Only recently ( with Opens ZFS 2.1.0-1 linux )
have found that if, by some mishap, a zfs pool is lost | missing during systemd boot I get with
this running message :
A start job is running for import ZFS pools by cache file ( 1h 23min 22s / no limit )
and I am unable to boot ...
I still have not edited any files but as soon as I "forgot" to "zpool export" the before prior to reboot if the pool is not present I am unable to startup my machine ...
( One detail is that I am using the zfs filesystem with External USB drives... ) So sometimes I forgot to export one of the USB pools leading to this issue.
Any chance of accepting a one word change from After=multipathd.target
to After=multipathd.service
in /etc/systemd/system/zfs-import.target.wants/zfs-import-cache.service
for Linux systems? I have confirmed that this is still needed on multiple Rocky Linux 8 systems with a large number of multi-path SAS drives (qty 60) running ZFS 2.1.5 to avoid zpool import from starting before multipathd is done at boot time. And it doesn't seem to cause any problems for smaller systems. However, I am still a systemd newbie and there may be a better solution, but simply changing from .target
to .service
works on my systems and it would be nice not to have to patch this unit file after every ZFS upgrade.
The downside to not patching this is that some users may be running pools that have successfully imported multipath capable devices without actually having the path redundancy that they expect until they discover that the hard way when the non-redundant path that zpool import grabbed fails.
Tried a lot of different variants from here and around the internet. Finally I found a workaround by adding a custom service unit for the pool, in addition to adding zfs
to /etc/modules-load.d/zfs.conf
.
/etc/systemd/system/zfs-import-storage.service
:
[Unit]
DefaultDependencies=no
Before=zfs-import-scan.service
Before=zfs-import-cache.service
After=systemd-modules-load.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/zpool import -N -o cachefile=none storage
# Work-around to preserve zpool cache:
ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
[Install]
WantedBy=zfs-import.target
Then:
echo "zfs" > /etc/modules-load.d/zfs.conf
systemctl enable zfs-import-scan.service
systemctl enable zfs-import-cache.service
systemctl enable zfs-import-storage.service
After that the pool is imported and ready on boot, although I still get these messages:
Apr 01 17:52:38 host systemd[1]: Dependency failed for Install ZFS kernel module.
Apr 01 17:52:38 host systemd[1]: Dependency failed for Import ZFS pools by device scanning.
Apr 01 17:52:38 host systemd[1]: Dependency failed for Import ZFS pools by cache file.
I can live with that for now.
I used the systemctl service above as a reference, but took a slightly different approach, more similar to a service I created years ago involving a race condition with bind mounts and ZFS. I have only tested on one system (Fedora-38), which uses zfs-import-cache, but it worked there. This service really does nothing other than cause systemctl to reorder the services involved.
cat <<EOF > /etc/systemd/system/zfs-import-race-condition-fix.service
[Unit]
DefaultDependencies=no
Before=zfs-import-scan.service
Before=zfs-import-cache.service
After=systemd-modules-load.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/cat /dev/null
[Install]
WantedBy=zfs-import.target
EOF
systemctl enable zfs-import-race-condition-fix
echo "zfs" > /etc/modules-load.d/zfs.conf
The system I am testing on is a Fedora-38 box. I do not see the errors mentioned above, but am seeing warnings related to a deprecated service called by the zfs-import-* services.
Jun 17 09:55:16 myhost udevadm[980]: systemd-udev-settle.service is deprecated. Please fix zfs-import-cache.service, zfs-import-scan.service not to pull it in.
@sbbeachvball wrote:
The system I am testing on is a Fedora-38 box. I do not see the errors mentioned above, but am seeing warnings related to a deprecated service called by the zfs-import-* services.
Very nice, I just tried this and works on a Debian 12 machine as well. Everything is imported on boot without the hacks for cache-restoration.
Thanks for sharing !
I met the similar issue in Debian 12 on a SBC but figured it out with a different solution:
$ sudo echo zfs >> /etc/modules-load.d/modules.conf
zfs-import-cache.service
is disabled by default, I have to enable it manully, but it relies on the service zfs-import
which is masked somehow on my distribution and I can't even unmask it. Thus the enabled service which is created on the /etc/systemd/system/zfs-import.target.wants
won't execute at all. I have to rebuild the dependency chain on the /etc/systemd/system/zfs.target.wants
by creating it manully
$ sudo ln -s /lib/systemd/system/zfs-import-cache.service /etc/systemd/system/zfs.target.wants/zfs-import-cache.service
zfs-import-cache.service
[Unit]
...
After=systemd-modules-load.service
...
@sbbeachvball's suggested fix didn't work for me unfortunately. I had to insert a sleep 30
instead (as mentioned by @stuartthebruce)
In my case, the ZFS pool is located on a block dev which corresponds with a RAID array (megaraid_sas). I tried including an After=
line corresponding with the systemd device unit for that block device (as per advice here), but that didn't make any difference.
System information
Describe the problem you're observing
zfs-import-cache.service fails to import pool with 60 HDD from an external SAS JBOD at boot time.
Describe how to reproduce the problem
/sbin/shutdown -r now
Include any warning/errors/backtraces from the system logs
The zfs-import-cache failure appears to be a race condition with the kernel discovering/registering all of the HDD, e.g., from syslog,