Support freeze/thaw - Githubissues

behlendorf commented 13 years ago

ZFS has hooks for suspending the filesystem but they have not yet been integrated with their Linux freeze/thaw counterparts. This must be done before you can safely hibernate a system running ZFS.

asdf8dfafjk commented 3 years ago

Sorry but your point makes no sense. I do not claim that there are no issues. But posting repeated iterations of "this is a must", "I need this", "what is the status on this", " should be done" does not help progress the issue specially coming from people who have made 0 contributions to this project. (In fact when I saw the first "must" comment my first thought was someone speaking so authoritatively must be an active member, but alas, he has 0 pulls).

IMHO once the issue is known, the only valid contribution is identifying and discussing a fix, as some other kind members have already done and @behlendorf seems to be very prompt in responses

Let's not bother the maintainers anymore with noise. I have already proposed a workaround. If you're too lazy to use it, please dont expect maintainers to do the work for you.

bghira commented 3 years ago

this is also not a priority for any of the corporate sponsors of ZFS.

dm17 commented 3 years ago

For a root ZFS filesystem - is a swap required for any type of hibernation or suspension?

Atemu commented 3 years ago

Hibernation yes, suspension no.

danielmorlock commented 2 years ago

Using hibernation results in corrupted zpools sooner or later. That tells me that there might be an issue with freezing or thawing: I just used a corrupted zpool for weeks without knowing, since hibernation always resumed a proper zpool state so that even "zpool status" reports no errors in the resumed system. After a reboot, I was not able to import the zpool because of:

[root@sysrescue /mnt]# zpool import -f -R /mnt/rpool_ws1 -o readonly=on  rpool_ws1
cannot import 'rpool_ws1': I/O error
    Destroy and re-create the pool from
    a backup source.

So there should be either a more prominent warning in the docs that warns user from using ZFS on a desktop pc or the hibernation process should be cancelled from ZFS. The current state is dangerous since users can fall into this trap and can loose data.

dm17 commented 2 years ago

Is there any evidence of suspend doing this? As far as I know this only applies to "hibernation" (aka suspend to disk, not suspend to memory aka "s3").

bghira commented 2 years ago

ZFS has no control over this.

bghira commented 2 years ago

also, even if the distribution disables hibernate on ZFS, people just go out of their way to avoid all warnings and do whatever they want: https://askubuntu.com/questions/1266599/how-to-increase-swap-with-zfs-on-ubuntu-20-04-to-enable-hibernation

eblau commented 2 years ago

@bghira It doesn't seem believable to me that ZFS has no control over this. Surely there are callbacks that can be registered with the kernel to be invoked when hibernate is invoked. At the very least these could be implemented to hang or kernel panic instead of allowing hibernate to proceed and silently corrupt the zpool.

bghira commented 2 years ago

as i understand, they're GPL-only symbols.

danielmorlock commented 2 years ago

After digging into the kernel hibernation (suspend to disk) process with @problame, we figured out that ZFS (event with hibernate) did not cause the zpool corruption. Further during our (rough) analyze, we did not found a reason why ZFS wouldn't work with hibernation iff the swap is outside of ZFS. Even without the freeze_fs hooks mentioned by @kohlschuetter, hibernating ZFS should "just work". I guess the hooks are relevant for doing hibernation to a swap that is inside the ZFS.

TL;DR: The problem was in genkernel(Gentoo automatic kernel building scripts) that includes a script for initramfs. This script is doing luks-encryption and boots from what is listed in the boot options. In the case of ZFS, I have a crypted swap and a crypted root including the ZFS pool. The initramfs script decrypts the root and imports the zpool BEFORE if decrypts the SWAP including the RAM state for hibernation. So the pool is always imported and then hibernate resumes the system where the zpool is already imported and online. I guess it is probably the reason for my corrupted zpool.

Thanks @problame for your support.

AttilaFueloep commented 2 years ago

Well I can add to that. I'm hibernating regularly since a couple of years without any problem so far (knock on wood). I've root on zfs but boot and swap on luks on top of mdadm mirrors. I'm using mkinitcpio if that matters.

luke-jr commented 2 years ago

Maybe ZFS should refuse to mount read-write without the user forcing it if it believes it was mounted at hibernation? (I'm assuming a read-only mount won't change anything on disk...)

bghira commented 2 years ago

if you can help point out which docs might need updating to include these hints then we might make some progress on it

i know there's a few install guides that are hosted by the OpenZFS project that could be enhanced. each one could link to a page of caveats, which would mean just one spot to maintain them.

i would suggest that it be added to the man pages but ever since they were 'modernised' by breaking them out into separate pages, i have found them to be less useful and rarely search them for information that's now easier to find on Google.

eblau commented 2 years ago

@danielmorlock @problame wow, thanks for that hint on the initramfs scripts issue! I use Arch Linux and think that I'm in the same situation with the scripts importing the zpool and then decrypting swap on resume.

I store the swap partition's key in the zpool itself so on resume it does the following:

Runs "cryptsetup open" to prompt for the password to open the encrypted LUKS partition with the zpool on it.
Imports the zpool.
Invokes "cryptsetup open" to open the encrypted LUKS swap partition using the swap partition's key in the zpool.
Unmounts the root dataset and exports the zpool.
Resumes from hibernate.

I'm assuming that step 2 is the issue since the state of the zpool on disk could then differ from the in-memory state in swap that we resume from. Would this work if the pool is imported read-only in step 2?

problame commented 2 years ago

@eblau yeah, that sound unhealthy.

Maybe ZFS should refuse to mount read-write without the user forcing it if it believes it was mounted at hibernation? (I'm assuming a read-only mount won't change anything on disk...)

zfs import should actually fail. I guess many scripts use zpool import -f to shoot themselves in the foot :) @danielmorlock can you confirm it was using zpool import -f?

Regardless, I think -f shouldn't be sufficient to import a pool that was hibernated. Idea:

Hibernation workflow:

somehow get notified by the kernel that we're hibernating
- If I remember today's session correctly, freeze_fs and thaw_fs are not useful for this.
generate a random hibernation cookie
store the hibernation cookie in the in-DRAM spa_t, and somewhere on disk, let's say in the MOS config
wait until the disk change txg has synced out
but let all later txgs get stuck in transitioning from quiesce -> syncing
- implementation in txg.c, just prevent the transition from happening
now allow hibernation. it must not be allowed before.

Resume workflow:

Let initrd restore kernel threads and userland
Somehow get notified from the kernel that we're resuming. That should be possible.
Load the MOS config from disk
Compare the hibernation cookie stored in the MOS config with the one we have in the (restored) DRAM
- If they don't match, panic the kernel with the following message:
```
zpool was used inbetween hibernation and resume
```
- If they match, allow quiescing -> syncing transitions again.

To prevent accidental imports, we extend zpool import / spa_import such that they will fail by default a hibernation cookie is present in the MOS config. This behavior can be overridden by a new flag zpool import --discard-hibernation-state-and-fail-resume.

Thoughts on this design, @behlendorf ?

eblau commented 2 years ago

@eblau yeah, that sound unhealthy.

Maybe ZFS should refuse to mount read-write without the user forcing it if it believes it was mounted at hibernation? (I'm assuming a read-only mount won't change anything on disk...)

zfs import should actually fail. I guess many scripts use zpool import -f to shoot themselves in the foot :) @danielmorlock can you confirm it was using zpool import -f?

My initcpio script is not using -f. Here are the exact commands it runs:

    modprobe zfs
    mkdir /crypto_key_device
    zpool import -d /dev/mapper -R /crypto_key_device zroot
    cryptsetup open --type=luks --key-file /crypto_key_device/etc/swapkeyfile --allow-discards /dev/nvme0n1p3 cryptswap
    zfs unmount -f /crypto_key_device
    zpool export zroot

danielmorlock commented 2 years ago

zfs import should actually fail. I guess many scripts use zpool import -f to shoot themselves in the foot :) @danielmorlock can you confirm it was using zpool import -f?

Im pretty sure, that -f was not used: man genkernel says:

       dozfs[=cache,force]
           Scan for bootable ZFS pools on bootup. Optionally use cachefile or force import if necessary or perform both actions.

My kernel command line was:

options dozfs crypt_root=UUID=612a36bf-607c-4c8f-8dfd-498b87ea6b7f crypt_swap=UUID=8d173ef7-2af5-4ae5-9b7f-ad06985b1dd0 root=ZFS=rpool_ws1/system/root resume=UUID=74ef965e-688b-495d-95b4-afc449c15750 systemd.unified_cgroup_hierarchy=0

In the initrd phase, the zpool is tried to import before resuming from suspend to disk. So this is equal to import an already imported zpool from a different kernel. Is it? Does ZFS track that it is already imported (from another system)?

bghira commented 2 years ago

I'm assuming that step 2 is the issue since the state of the zpool on disk could then differ from the in-memory state in swap that we resume from. Would this work if the pool is imported read-only in step 2?

@behlendorf would be best to answer but from past discussions i recall him saying this would need to be very carefully handled (and might not be possible to, since many symbols are GPL-only surrounding the hibernation code, possibly why nvidia's implementation sucks as well) as it could lead to undefined behaviour and crashes, as the ZFS code is currently written.

AttilaFueloep commented 2 years ago

@problame

That design sounds reasonable to me.

@eblau

I'm assuming that step 2 is the issue

Yes, definitively. Once I wrecked a pool beyond repair by accidentally resuming. I wrote to it between hibernation and resume from a different system and after the resume things went south. But even an import/export cycle alone will change the pool so it won't match the state which is stored in the hibernation image.

Would this work if the pool is imported read-only in step 2?

IIRC there was an issue with read only import modifying some state of the pool. Not sure what the current situation is.

@danielmorlock

In the initrd phase, the zpool is tried to import before resuming from suspend to disk. So this is equal to import an already imported zpool from a different kernel. Is it

Yes.

Does ZFS track that it is already imported (from another system)?

Not by default, MMP (Multi-modifier protection aka multihost) handles that but I can't tell if it would work in this case.

Greek64 commented 2 years ago

@eblau Without getting too much offtopic, may I ask why you are doing this 2-stage LUKS unlocking? Is it because you only need to insert the unlock passphrase once (Since the swap is then unlocked by file)?

If so, why not use something like decrypt_derived or decrypt_keyctl?

decrypt_derived basically generates a key based on an unlocked LUKS container. So you could unlock the root container normally, and the swap container is then automatically unlocked by a derived key based on the root container.

decrypt_keyctl is basically a key cache store. If both containers use the same passphrase, you only insert the passphrase once, it is stored in cache and then used for all containers.

Also - in a hibernation sense - wouldn't it make more sense to decrypt the swap first before the root?

eblau commented 2 years ago

@Greek64 I will check out the decrypt_derived and decrypt_keyctl suggestions, thank you. I do indeed do 2-stage LUKS unlocking to avoid typing a password twice.

I do it this way due to ignorance. :) I researched LUKS on Arch Linux wiki pages and implemented that using the 2-stage unlock approach and then added hibernate/resume later without recognizing the bad interaction between the two.

Definitely it makes more sense to decrypt the swap before the root. That's why when I saw the explanation from @danielmorlock and @problame, I immediately realized the error of my ways.

Sorry for troubling the ZFS folks with this issue. The subject of this issue made me think that it was some ZFS support missing. Crazy thing is that I hibernated for like 2 years every day using this approach and only hit zpool corruption like 3 times. Luckily I take backups religiously and never lost much due to the magic of zfs send/receive.

danielmorlock commented 2 years ago

I've opened a bug ticket for genkernel: https://bugs.gentoo.org/827281

problame commented 2 years ago

We should close this issue since @behlendorf 's original motivation is misleading people into believing freeze & thaw are related to or required for supporting hibernation.

Barring some uncertainty from my side about in-flight IOs, my current understanding is that it's safe to hibernate a system with an imported pool if and only if the swap space into which the hibernation image is written resides outside of the zpool. As @danielmorlock and I figured out it's a very brittle process though, since ZFS has no safeguards if your initrd scripts accidentally import the zpool on boot-to-resume. I have outlined a design to improve this situation above and will create a new issue for it.

A few words on freeze_fs / unfreeze_fs, since they have been mentioned in this thread. The idea behind these callbacks is to block out all VFS ops to a single struct super_block through an rwlock. Here's the kernel function freeze_super that is invoked from the ioctl, if freeze_fs != NULL. As I understand it, the idea is that, with a one-super-block-per-bdev type of filesystem like XFS or Ext4, userspace can freeze the filesystem using the ioctl, then create a block-level snapshot or backup, then thaw it again using another ioctl. If that understanding is correct and an exhaustive description of the use case, then I believe it is ill-advised to implement the callbacks for ZFS, since other datasets (= other super blocks) will continue to do IO to the same zpool. And even if userspace is careful to freeze all datasets, the Linux VFS isn't aware of the ZFS management operations that perform IO to the pool (send/recv, properties, ...).

Note that there are also the super block ops freeze_super / thaw_super (yep, confusing naming). A filesystem can implement these instead of freeze_fs / unfreeze_fs if it wants to implement the infrastructure for locking out VFS ops itself instead of using the one provided by the freeze_super function.

Note also: btrfs. It's the mainline filesystem most similar to ZFS with regard to pooled storage (multiple blockdevs!) and multiple super blocks on top. Btrfs implements freeze_fs/unfreeze_fs. But the btrfs secret sauce is that a single btrfs filesystem (= pool in ZFS terms) only has a single struct suber_block - the subvolumes (= ZPL datasets in ZFS terms) are implemented through mount_subtree.

problame commented 2 years ago

Maybe one more remark regarding just how brittle hibernation currently is: hibernation does a system-wide sync prior to "freezing" userspace and kernel threads. But there is no transactionality here, i.e., userspace and kernel threads can continue to dirty DMU state, issue and execute ZIOs to the pool, until they are "frozen" (sigh, naming, it should be called "descheduled for hibernation").

What does this mean?

If we are able to successfully resume, everything should be ok, IF AND ONLY IF no zio's have been dropped on the floor during hibernation. Otherwise, those lost zios might later look a lot like data corruption. Later could be a lot later, if the hibernate image had the data cached in the ARC.
- I haven't had time to look into this yet. Maybe the kernel quiesces the IO stack is quiesced before threads, I don't know. I wouldn't want to place a bet on it.
If we are unable to resume, the situation is like a hard crash shortly after txg sync / ZIL write.
- In theory, the pool should be safe to import. You might lose a few seconds worth of data. But since the kernel does the equivalent of a sync during hibernation, you probably didn't lose much in practice. Sync write guarantees should continue to hold as well.
- In practice, this still exercises a non-happy code path.

behlendorf commented 2 years ago

@problame your general design makes good sense to me. It's been a while since I looked at this (years), but I agree it should largely be a matter of suspending the txg engine when hibernating and then resuming it after the system state has been restored.

generate a random hibernation cookie store the hibernation cookie in the in-DRAM spa_t, and somewhere on disk, let's say in the MOS config

One avenue you may want to explore here is leveraging the existing zio_suspend() / zio_resume() machinery. Fundamentally this code implements the logic to cleanly halt the txg engine, including suspending the zio pipeline and halting any inflight zios. When resuming the pipeline is restarted and any halted zio's are resubmitted and allowed to proceed as if nothing happened. Today this is what's used to suspended the pool when, due to drive failures, there's no longer sufficient redundancy to allow the pool to continue operation. But I could image extending so the pool could be intentionally put in this suspended state for hibernation and a cookie stored.

Opening a new issue to track this work sounds like a good idea to me. I'd just suggest that we somehow reference this old issue from the new one to make it easy to find. I don't mind closing this one at all once we have a replacement to continue the discussion.

problame commented 2 years ago

@behlendorf I have created two follow-up issues:

I think you can close this issue now.

alek-p commented 2 years ago

possible related to https://github.com/openzfs/zfs/issues/13879

openzfs / zfs

Support freeze/thaw #260