zfs-import script does not honor the cachefile

ryao commented 9 years ago

A pool should not be imported unless it is already imported on the system in the cachefile to avoid issues with multipath. I had not scrutinized the zfs-import script too closely when @FransUrbo proposed it and unfortunately, this got by me. Another Gentoo developer who setup a system with a separate pool for /home has the boot process breaking on him because the script is not finding it, but if the cachefile's contents were properly obeyed, this would not be an issue. The scanning behavior should not be the default and the script should import when it knows an import is acceptable, rather than all the time. e.g. a pool on a USB key would be auto-imported by this, which is likely not what the system administrator would want.

FransUrbo commented 9 years ago

A pool should not be imported unless it is already imported on the system in the cachefile

I'm not exactly sure I understand what you're saying… Are you saying that it imports the pool TWICE?!

Bronek commented 9 years ago

@ryao I think you meant to say "A pool should not be auto imported , unless it was imported previously and such fact has been persisted in cachefile" . At least, this is what your justification seem to point to.

However, this would prevent auto importing of pools which are not configured to use cachefile, or are configured to use non-standard location of a cachefile. Since cachefile is on way to obsolescence, it does not seem a like a very good idea. Perhaps the problem that other Gentoo user has experienced could to be addressed differently. FWIW I use similar setup (root i.e. / in one pool and two directories under root in two other pools , all are auto imported on boot) and it works for me without issues, under ArchLinux. It worked when I was using cachefile (and saved it to kernel RAM image) and it still works when I updated cachefile to none for all pools (and removed it from kernel RAM image).

ryao commented 9 years ago

@FransUrbo The import/export is not like mounting or RAID assembly. An import says "this pool belongs to this system, so do not touch it" while an export says "this pool no longer belongs to this system and is fair game". The way that the import persists across reboots is by something called a verbatim import, which reads the details of the imported pools from the cachefile. The fact that an import is being done explicitly in zfs-import makes using accurate terminology here difficult. What should be done at boot is a verbatim import from the cachefile, rather than a normal import. A normal import should never be done by automated scripts.

@Bronek People have talked about getting rid of the cachefile in the past, but it is required for things like multipath to be done sanely, accelerates the boot process and has no alternative that is able to offer similar semantics. I used to think that the cachefile should be eliminated, but I found that the rationale for eliminating the cachefile to be invalid upon actually trying to make it work.

Rather than trying to get rid of the cachefile and having regressions, we should be modifying our systems to better integrate with it. For instance, initramfs archives could read the cachefile from the bootfs by doing a readonly forced import, mounting the bootfs, grabbing the cachefile, unmounting, exporting and then doing the verbatim import from the cachefile. This would eliminate the need to have the cachefile in the initramfs and make things behave like they should. initramfs generators such as Gentoo's genkernel do not currently do this, but they should in the future.

As for why things work fine on Arch Linux, that uses systemd, which is properly using the cachefile:

https://github.com/zfsonlinux/zfs/blob/master/etc/systemd/system/zfs-import-cache.service.in

FransUrbo commented 9 years ago

@FransUrbo The import/export is not like mounting or RAID assembly. An import says "this pool belongs to this system, so do not touch it" while an export says "this pool no longer belongs to this system and is fair game". The way that the import persists across reboots is by something called a verbatim import, which reads the details of the imported pools from the cachefile.

Still don't get it… For instance, initramfs archives could read the cachefile from the bootfs by doing a readonly forced import, mounting the bootfs, grabbing the cachefile, unmounting, exporting and then doing the verbatim import from the cachefile.

That require the first cache file to be accurate. AND up to date! Which is the problem, they seldom are, which is why people are having problem with this. And why we want to remove it...

If the first cache file is accurate enough to import the pool, there's no NEED to get a second cache file and import it again. If the import was successful the first time, there's no need to do it again..

This would eliminate the need to have the cachefile in the initramfs and make things behave like they should.

Ok, so NOT using a cache file for the first import. Ok. But how do you import (or get to the bootfs) without one, if a import without cache file doesn't work? If it can be imported without one, then there's no point in having the cache file in the first place, which (again) negate the need to get the cache file.

Bronek commented 9 years ago

@ryao I do not have cachefile, and yet all my pools are mounted automatically on boot:

root@gdansk ~ # zpool get cachefile
NAME   PROPERTY   VALUE      SOURCE
zdata  cachefile  -          default
zpkgs  cachefile  -          default
zroot  cachefile  -          default

root@gdansk ~ # ls -al /etc/zfs
total 57
drwxr-xr-x  3 root root    9 Sep 12 16:40 .
drwxr-xr-x 59 root root  143 Sep 13 12:46 ..
-rw-r--r--  1 root root  165 Sep 12 16:17 vdev_id.conf.alias.example
-rw-r--r--  1 root root  166 Sep 12 16:17 vdev_id.conf.multipath.example
-rw-r--r--  1 root root  520 Sep 12 16:17 vdev_id.conf.sas_direct.example
-rw-r--r--  1 root root  152 Sep 12 16:17 vdev_id.conf.sas_switch.example
drwxr-xr-x  2 root root   12 Sep 12 16:40 zed.d
-rwxr-xr-x  1 root root 9355 Sep 12 16:17 zfs-functions
-rw-r--r--  1 root root 6516 Apr  6 22:58 zpool.cache.bak

jameslikeslinux commented 9 years ago

I agree with @ryao. This is of particular importance when you consider a SAN environment, where a pool may be visible to multiple hosts (this is very common in virtualization environments, and for certain high-availability setups). You do not necessarily want the first system to boot to grab all of the visible pools. Hosts should consult the cachefile to determine which pools should be imported at boot. This is the way Solaris works, and Linux should follow.

Bronek commented 9 years ago

it makes sense, I was thinking about using hostid (you could mount read-only, compare hostid stored in the pool and then re-mount for write if its yours) but then it occurred to me that I'd rather get rid of hostid and keep cachefile than other way around.

FransUrbo commented 9 years ago

if all you want to do is ignore ("not import") certain pools, use ZFS_POOL_EXCEPTIONS. That currently takes the name(s) of the pool(s), but it could very easily be extended to account for *.

FransUrbo commented 9 years ago

I still have no idea what this ticket is about! Nothing in the comments by @ryao makes any sense… !

jameslikeslinux commented 9 years ago

Everything @ryao has made sense, and he's right. I can see how it doesn't make sense when your plan (https://github.com/zfsonlinux/zfs/pull/3526) involves getting rid of the cachefile completely. I wish I had seen that earlier, because it is, in my opinion, a very bad idea. Let me explain why:

The cache file as a concept has existed since the beginning of ZFS. From the manpage, zpool(1M), we learn that:

Discovering all pools on system startup requires
a cached copy of the configuration data that  is  stored
on  the  root  file  system. All pools in this cache are
automatically  imported  when  the  system  boots.

The contents of the cachefile are effectively controlled by which pools you import and export. When you import a pool, its details are saved in the cache and the system will reimport the pool every time the system boots, until you export it. There is never a need to export a pool until you are ready to import it on another system. It was simple, and beautiful, and all of your previously imported pools could be imported with one command:

zpool import -c /etc/zfs/zpool.cache -a

This is compared to your new code which takes:

awk '/^do_import/ { flag=1 }; /^}/ { flag=0 }; flag { print }' /etc/init.d/zfs-import | wc -l
185

185 lines.

And your code is incorrect. It tries to import all available pools. You say you you can ignore pools by setting ZFS_POOL_EXCEPTIONS (funny, I don't remember that on my Solaris certification exam)--but that is neither user nor system administrator friendly. For users: are they supposed to set the ZFS_POOL_EXCEPTIONS variable for every USB zpool they might insert, so that it doesn't automatically get imported at boot?

For system administrators: are they supposed to update that variable every time a new pool appears on their SAN?

So you say the variable can be modified to accept wildcards, but that just gives you the same situation in reverse. Now I have to ask, is the user supposed to update the variable every time they plug in a USB zpool that they want mounted at boot? Is a system administrator supposed to update that variable every time they attach a new pool to their SAN that they want automatically imported at boot?

These are exactly the sort of challenges ZFS was designed to eliminate. It's the same reason we don't have to update /etc/fstab for every new ZFS filesystem we want to mount.

I say to you: we already had a solution to this problem, a cache file. Do not eliminate it. And let's fix the zfs-import service to maintain the way ZFS has always worked: import pools based on the cache file, and do not export pools at shutdown. Only the user or system administrator should export pools.

FransUrbo commented 9 years ago

As I said in https://github.com/zfsonlinux/zfs/pull/3800#issuecomment-141643856, the correct fix is to set:

ZPOOL_IMPORT_OPTS="-c /etc/zfs/zpool.cache"
ZPOOL_CACHE=""

in the defaults config file (/etc/defaults/zfs or corresponding location on Redhat/Fedora etc).

That setup will enforce the use of cache file, and ONLY use the cache file (i.e., try once with that ONLY).

jameslikeslinux commented 9 years ago

And as I said in https://github.com/zfsonlinux/zfs/pull/3800#issuecomment-141687925, @FransUrbo's suggested fix doesn't work because:

The current infrastructure exports pools at shutdown, which completely negates the point of the cache.
Unless made the default, the current infrastructure is still vulnerable to https://github.com/zfsonlinux/zfs/pull/3526#issuecomment-141617698 and the number of issues raised by @ryao in this issue.

ryao commented 9 years ago

The following patch from the genkernel zfs branch makes genkernel read the cachefile from the pool and verbatim import pools from it. It will be merged to genkernel HEAD after review has been completed:

https://gitweb.gentoo.org/proj/genkernel.git/commit/?h=zfs&id=00e8b41e573acbebd60275792eb6b4a89b834882

For systems where an initramfs is not used to load the pool, the boot scripts should just load from the cachefile. It should be a fairly simple matter since they already have access to the cachefile. I am fairly confident that the cachefile issues that people have had are caused by the initramfs have its own copy of the cachefile.

This is analogous to what the current systemd unit files do.

FransUrbo commented 9 years ago

I don't see the point of doing it this way! If the root pool could be imported read only, why bother with all this??

FransUrbo commented 9 years ago

Also, it doesn't take that fact into account - there's no check for a failed import!

FransUrbo commented 9 years ago

Oh, and the third problem is that if the ro import succeeded, but there is no cache file in there (because user have elected not to have one), everything fails… Even though the pool could be imported!

Bronek commented 9 years ago

@ryao I like this approach.

I don't see the point of doing it this way!

... the pool might be unusable for read-write, e.g. because it is SAN and is accessed by a different node at this time. I can think of other reasons. This does not prevent reading one file in a well known location, though.

Also, it doesn't take that fact into account - there's no check for a failed import!

good point

Oh, and the third problem is that if the ro import succeeded, but there is no cache file in there

also good point, but this probably calls for a different boot switch (* edit below). There are basically two approaches and I like one presented by @ryao better than what ZOL is doing now, i.e. trying to import all available pools (including those in ZVOLs)

EDIT: @ryao's solution requires "root=ZFS=\$DATASET" to be provided at boot. I like the way it's explicit, you know what to expect just glancing at the kernel options, and it happens to fit my expectation as to "how ZFS should work", as a lowly user (I managed some FC SANs in the past, if that matters). Having said that, I can imagine a whole class of users who might want ZFS to automagically import all available pools. It seems possible (i.e. to intuitively understand) to perform auto-import when there is no dataset name following root=ZFS . However when there is, I would really prefer if auto-import did not take place.

FransUrbo commented 9 years ago

If you don't want any pools to auto import at boot up (other than the root pool), just don't enable the zfs-import script!

If you want a subset of all available pools to import and you don't want to (or can't) setup exceptions, ok, that is a problem.

But there's a much better way to do this than to rewrite the whole shebang (which works for the large majority as it is).

BTW, Most (all?) distributions have (not Gentoo??) have always required root=ZFS= (or something to that effect). That is nothing new. Granted, Debian GNU/Linux (and deviated distributions) also require a boot=zfs, but that's only to avoid putting absolutly every single conceivable option into ONE, humongous script (which I personally find retarded! :).

Bronek commented 9 years ago

If you want a subset of all available pools to import and you don't want to (or can't) setup exceptions, ok, that is a problem.

there you go. My /boot is ext2, my root and two more pools are ZFS and, unfortunately, I often at the time of startup have other ZFS pools attached which I am not comfortable importing (e.g. because they are in ZVOLs , or because I left USB attached when starting - they tend to be random). It might be unusual scenario (but I think that it's not, naiively perhaps) and one which, apparently, cachefile is designed to serve best. I am not saying that other use cases are not better served by auto-import. FWIW my distribution is ArchLinux, and the active maintainer of ZFS modules there is not very active on this list, so I will probably have to massage the scripts myself. For this I really, really need to understand what I'm doing.

FransUrbo commented 9 years ago

If there's only a few, known, pools (which it sounds like) that you don't want imported, use ZFS_POOL_EXCEPTIONS.

In #3559, I've now added a ZFS_POOL_IMPORT variable which does exactly what you want - it is a list of pools to auto-import.

ryao commented 9 years ago

@FransUrbo The check for a failed import will be added. Thanks for pointing it out.

As for editing configuration files, it should not be necessary. The zpool import command should make all the changes to ensure a persistently imported pool. That is how it works on Illumos, FreeBSD and systemd systems. Gentoo is going back to this soon. It should not work any other way by default.

ryao commented 9 years ago

@FransUrbo There is code in genkernel that will allow people to drop to a shell and execute a manual import of the pool if failure occurs from the pool not being imported. It had not been intended to be used this way, but now that we are switching to read the cachefile from the system, it will be. I am looking into including instructions to handle the root pool missing from the cachefile or the cachefile being missing as that is a real concern. In particular, telling the user to execute zpool import -N $pool in the shell, press Cntl+D and execute zpool set cachefile= $pool (space after equal sign) as root after the system has booted.

openzfs / zfs

zfs-import script does not honor the cachefile #3777