'zpool import' dynamically via udev

behlendorf commented 13 years ago

There are times where it would be desirable for zfs to not automatically import a pool on module load. Add a zfs_autoimport_disable module option to make this tunable without resorting to removing the cache file.

bill-mcgonigle commented 12 years ago

would it make sense to implement this as a pool property?

My use case: I have a Xen server with the VM's using zvols for disks, but now I'd like to have one VM actually use ZFS directly, so I'd like to mark its (different) pool to not be imported by the Dom0 on system boot. I'll assign the physical disks to the DomU. In this case it would be OK for me to import it in rc.local.

I suppose ideally this would look like:

autoimport_host=guid|any|none

so I could avoid the rc.local and put spl/zfs on initrd to get to it earlier for other cases. A default 'any' gets the current behavior, setting each pool to 'none' gets a 'zfs_autoimport_disable' behavior.

It would raise the question of what to use for a uuid. The zfs module would probably need to read /etc/zfs/zfs-host.uuid or something like that (perhaps even creating it if missing). I think I'd do best with a symlink to /sys/hypervisor/uuid.

Then again, zfs knows if I've moved a pool between machines - does it already have a host [g,u]uid?

behlendorf commented 12 years ago

That might be nice, but part of the trouble is it's not so straight forward to retrieve a pool property until that pool is imported. Now these properties are cached in the zpool.cache file, so we could perhaps store them there but that seems a bit clunky.

My feeling is the right way to do this on Linux is to disable all imports during module load, and then integrate the actual import with a udev helper. The udev helper will be invoked every time a new device appears. This helper can read the zfs label from the disk which fully describes the pool configuration. From this it can determine if all the needed devices are available any only then trigger the import. This solves at least two problems which have come up.

The pool import must not occur until all (or almost all) of the devices are available. With asynchronous device detection there's no good reason to believe all of your devices will be available when the zfs modules are loaded. This becomes increases likely the more devices you have.
The behavior for what pools to import on boot could be moved to a more standard /etc/zfs/zfs.conf file. This would include not just the pools you wanted imported, but the criteria for importing them. Perhaps one person wants to import a raidz pool automatically when it's missing 1 disk and another person doesn't. This gives us one place to set this policy information on a per-pool basis.

bill-mcgonigle commented 12 years ago

even better! Event-driven should have many benefits over time.

GregorKopka commented 12 years ago

The idea of having yet another file for zfs to update in the initramfs (apart from zpool.cache with the related troubles that one already gives) dosn't sound that good to me.

Maybe a module parameter that can be given via bootloader would do the trick more nicely?

behlendorf commented 12 years ago

Actually, the idea here it to have one less file that needs to be updated in the zpool.cache file. All pools could be detected automatically and only those with the required property (or passed somehow via bootloader) would be imported. You would no longer need a zpool.cache file at all, although we'd still probably support it for legacy reasons.

Rudd-O commented 11 years ago

This is a great effort. Can this be merged?

prakashsurya commented 11 years ago

@Rudd-O If you're asking about this commit:

https://github.com/prakashsurya/zfs/commit/dd7ce6e932617495305f8fcf9f815f73f08fc713

then, I'd say no, it's not ready to be merged. The commit message states the missing functionality that needs to be added to the patch before it is anywhere near complete, and then once that is added, testing is needed to ensure it works correctly. Unfortunately, I don't see myself working on that patch in the very near future, so if you have some spare cycles I would highly encourage you to pick it up and run with it.

Rudd-O commented 11 years ago

Do you need financial support to see this through so you can work unmolested by other concerns? I would be willing, depending on what factors you see are necessary to see thru.

prakashsurya commented 11 years ago

I'm not even sure that's possible with the way my current employer, LLNL, works. I'll see what I can do about getting some time to work on this.

Rudd-O commented 11 years ago

I just want a kernel module option to auto import disable. That is all.

behlendorf commented 11 years ago

@Rudd-O You can just comment out spa_config_load() in spa_init() as a quick fix to prevent the cache file from being read during import. Alternately, if you don't care about the cache file at all (which is where I'd like to go anyway) you can probably just set the spa_config_path mount option to something like /dev/null or perhaps better mktemp -u /etc/zfs/zpool.cache.XXXXXXXX.

Rudd-O commented 11 years ago

Brian, I want the cache file to be read during import. I just don't want ZFS to read it and then attempt to import on module zfs.ko load.

Today marks 20 hours of debugging a bug -- of the worst kind: initrd bugs -- caused by this. I have a mirrored pool where one disk is encrypted (and consequently not available when zfs.ko loads, on demand and as a side effect of udev events or whatnot, during the dracut cmdline phase) and the other disk isn't (thus available when zfs.ko loads). As a consequence, before the first zpool import in our mount-zfs.sh even executes, the pool is already half-assedly imported (missing one leg).

I tried the spa_config_path workaround, and it doesn't work well. Mainly if I set it to a dumb path, then later on it writes the file in that dumb path. And, of course, it does NOT read the (valid) cache file, which is what I want to prevent zfs force from being necessary. Even replaced my /dev/null with a regular file.

I ended up having to write this dumb workaround: https://github.com/Rudd-O/zfs/blob/master/dracut/90zfs/parse-zfs-pre.sh.in but this breaks the use of zpool.cache anyway, so I ended having to do zfs_force=1 anyway. Well, at least this doesn't totally break import of mirrored pools.

So, pretty please, now that we know it is inopportune to just import all pools on module load, can we please, please have a flag to turn this cancerous hell off?

behlendorf commented 11 years ago

@Rudd-O Try commit d028b00731ef15a86f957d52639c1e0505f578fa, it should get you the behavior your looking for. If it solves your issue we can merge it.

Rudd-O commented 11 years ago

My tree's draces fixes those boot issues caused by import-at-module-load-time. Enjoy!

ryao commented 10 years ago

@behlendorf A behavior like what you describe in the initial comment can be obtained using zpool set cachefile=none $POOLNAME.

FransUrbo commented 10 years ago

@behlendorf @ryao @Rudd-O @bill-mcgonigle @prakashsurya Why is this still open? Considering that we now/already have a zfs_autoimport_disable and that the zpool.cache will (eventually) be removed all together...

prakashsurya commented 10 years ago

FWIW, my old patch that's loosely related to this, #1587, was intended to go beyond the functionality of zfs_autoimport_disable, zpool set cachefile=none, and zpool.cache. That patch was intended to allow a user to specify which pools should get imported by default, under what condition(s) (e.g. importing pool with missing vdevs), and potentially run hooks at various steps along the device enumeration and import process; all without the need for the zpool.cache file.

FransUrbo commented 10 years ago

@prakashsurya Is this possible without a cache file etc? The pull request mentioned seem to create a special cache file for this, but that's seems a little dumb, since we're removing the 'real' one... ?

prakashsurya commented 10 years ago

It depends what you mean by "cache file". It creates a temporary file to maintain state as udev populates the devices, but it's different than the existing zpool.cache file. The idea was, as udev populates ZFS disks, we can read in the ZFS labels from each disk and basically create the existing zpool.cache on the fly.

This temporary file was only done as an optimization, alternatively you could probe every disk on the system as each new disk that is initialized by udev to build up the pool configuration, but that turns a O(N) algorithm into O(N^2). And with "a lot" of disks, that'd take an absurd amount of time.

Do you have a better idea about how to automatically import a pool after a reboot, once the zpool.cache file is removed? AFAIK, currently, if the modules are loaded before the disks are populated via udev, the pool will not be imported. Also, I don't think a "degraded" pool will be automatically imported with the current code; which will probably be needed for large production JBOD configurations. It's been awhile since I looked at all that machinery though, so perhaps I'm overlooking something.

FransUrbo commented 10 years ago

How about using ZED for this instead of UDEV?

Rudd-O commented 10 years ago

In my uninformed opinion, udev should signal to zed when ZFS component devices appear and disappear, while zed should (in some cases) take action based on whether the devices can conform a pool or not. but this raises all sorts of questions like, should zed start an incomplete pool and then add devices that should have been in the pool to begin with, what happens when zed is not running at the time of the first few udev notifications, what happens if zed crashes, et cetera.

It's a complicated thing. I think ultimately what we all want is something like, some running program gets some signal (or perhaps a program is executed at that point) when a pool is available for import, and this program decides whether to import the pool or not. With my old systemd-based system, I sorta had something like that: I'd open the cachefile and ask it what devices to wait for, what pools they conformed, and then generated unit files to import the pools based on those dependencies, which then would be dependencies in their own right to the file systems within them. Made for a fairly reliable system. Now I'm back to mainline zpool import all unit files, which fail.

FransUrbo commented 10 years ago

Well, the 'udevadm monitor' should be reasonably simple to emulate inside zed. If zed is dealing with this, the script zed execute for this could have a config file where this is set. And if zed isn't running, well the same problem/argument could go for udevd, nothing is completely bulletproof. But considering zed is going to take care about a lot of action regarding the pool (such as spares, keep an eye open for checksum and i/o errors and in the future most likely to deal with sharing/unsharing of smb/iscsi/nfs etc, etc), it makes sense that zed is responsible for deciding what pool should be imported and not. So the init script might instead of doing the mount/share etc, just issue those commands to zed and have it do the actual work (through one of it's scripts).

I think that would simplify a lot! Currently there's probably five different init scripts (all different, but do basically the same thing) in ZoL. On top of that, there's the different packages init files (the one we have in pkg-zfs for example). I've tried to rectify that into TWO init scripts (one import+mount/umount and one share/unshare) that is supposed to work on ALL the platform. But it's somewhat cludgy and possibly ugly in places because of this. If we could have zed do this, there wouldn't be any need for this - one action, one [zed] script and it works everywhere because zed is...

prakashsurya commented 10 years ago

Integrating udev into the zed infrastructure might not be a bad way to go, now that the zed work has landed. When I originally started that work, zed was far from finished (wasn't even started IIRC). But, now that it's here, I'm not opposed to designing the import infrastructure more tightly with it.

One main thing we'd need to work out to make that happen, is how to allow userspace processes the ability to issue "events" into the zed infrastructure. I'm not too familiar with all of the zed machinery, but IIRC zed currently only consumes events issued via zpool events; and there isn't currently a way for a userspace process (e.g. a udev helper) to push events into the kernel to be processed by zed.

Without thinking too hard about it, leveraging udev to submit events to zed (e.g. disk /dev/sdX has appeared/disappeared), but keeping the policy decisions and configuration within zed (e.g. 9/10 leaf vdevs are present of this raidz2, go ahead and import) seems like a really clean solution to me.

behlendorf commented 10 years ago

@dun and I have talked about how the zed infrastructure could be extended to provide a few additional bits of functionality which might be useful here.

1) The ability for user space utilities to post arbitrary events to the kernel. These would then be consumed though the existing zed machinery.

2) The ability to modify some key/value pairs associated with the event in the kernel. This would provide a relatively easy way to build up some semi-persistent state without each script resorting to its own cache file.

3) The ability to post blocking events. In this case the kernel would post the event and then block until it was consumed by the zed and a return value passed back. This would provide a nice portable mechanism to replace the usermodehelper code. It's also a big part of what might be needed to move all the majority of the libshare code over to being scripts called by the zed.

All of this functionality might be helpful in this context.

FransUrbo commented 10 years ago

@behlendorf @dun A 'cron like' event driver would be nice to...

Having ZED taking care of daily/weekly/monthly scrubs and maintenance seems possibly more appropriate than cron.

FransUrbo commented 10 years ago

@behlendorf Tag this with 'zed' as well?

behlendorf commented 10 years ago

Good thought

h3lo commented 2 years ago

It's been a few years since there's been any chatter on this issue. Has anything changed related to this import behavior? I expect that the issue still persists in some form as I found my way here from a very recent document on OpenZFS (https://openzfs.github.io/openzfs-docs/Getting%20Started/Ubuntu/Ubuntu%2022.04%20Root%20on%20ZFS.html#mpt2sas).

Just wondering what to expect here as I do in fact use the mpt2sas driver with LSI HBA cards, and well into the double digits of drives to be initialized. I'm moving over to ZFS from mdadm.

openzfs / zfs

'zpool import' dynamically via udev #330