Support freeze/thaw - Githubissues

behlendorf commented 13 years ago

ZFS has hooks for suspending the filesystem but they have not yet been integrated with their Linux freeze/thaw counterparts. This must be done before you can safely hibernate a system running ZFS.

devsk commented 13 years ago

This is a MUST if I ever want to be able to use ZFS on my laptop. If not rootfs, I would still want to use ZFS for my other FSs on the laptop.

It is possible that when I was rootfs on my laptop, I may have faced issues because of this.

devsk commented 13 years ago

Brian, This is a much needed feature. Any idea how much work is it?

behlendorf commented 13 years ago

I haven't carefully scoped it but my guy tells me it's probably not that much work. Why is this feature so critical? It's really only important for laptops correct? (Which I agree is important if your using a laptop)

What needs to be done here is to tie the Linux freeze/unfreeze hooks to the zfs_suspend_fs()/zfs_resume_fs() functions in zfs_vfsops.c. That should be just a couple lines of code, but then we need to review that change and make sure it's working as expected. Plus there will be the needed compatibility code for older kernels.

I'm not going to be able to get to this anytime soon but if you want to dig in to it I'm happy to review changes and comment. But I don't have time for the actual leg work on this right now.

devsk commented 13 years ago

It is important because 1. I absolutely need it on the laptop, 2. I need it on my desktop, which has been suspending to RAM/disk every night for last 7 years. It is best of both worlds: I save energy, I don't heat up my room in summer, and I get to restore my desktop workspaces just like they were the previous day.

Native ZFS has broken that tradition for me. And I would never want to blame ZFS for anything...;-)

I will dig into it though to see if I can come up with a patch for you.

kohlschuetter commented 13 years ago

This feature can be very important for home NAS environments, too.

These boxes are kept idling most of the time anyways, and S2R/hibernation can save a significant amount of power (about 15W with my setup).

I encourage implementing this, maybe this is the missing link to get suspend-to-RAM fully working on my Zotac Fusion NAS :)

kohlschuetter commented 13 years ago

To easily test freeze/thaw, we could use xfs_freeze (from xfsprogs). It is documented to work on other FSes, too. Currently, of course, for ZFS, it reports that it is unable to do so.

kohlschuetter commented 13 years ago

So, would this do?

    .freeze_fs  = zpl_freeze,
    .unfreeze_fs    = zpl_unfreeze,

into zpl_super.c's const struct super_operations zpl_super_operations

with

static int zpl_freeze(struct super_block *sb) {
            zfs_sb_t *zsb = sb->s_fs_info;
            return zfs_suspend_fs(zsb);
}

static int zpl_unfreeze(struct super_block *sb) { zfs_sb_t *zsb = sb->s_fs_info; const char *osname = // what goes in here? ; return zfs_resume_fs(zsb, osname); }

What about returned error codes? Are they compatible?

behlendorf commented 13 years ago

That's going to be the jist of it. However, the devils in the details and that's why this isn't a trivial change. The questions your asking are the right one but someone needs to sit down and read through the code to get the right answers. A few things to be careful of.

Ensure you negate the zfs_suspend_fs/zfs_resume_fs() return codes. Solaris internally uses positive values, Linux uses negative values. This inversion is uniformly handled in the ZPL layer for consistency. See zpl_syncfs() as an example of this, in fact all the zpl* wrapper functions do this. Make sure you add the ASSERT3S(error, <=, 0);.
Since the zfs_suspend_fs/zfs_resume_fs() function don't take a credential you won't need to worry about handling a cred_t.
zfs_resume_fs() takes the object set name as a second argument to reopen the dataset. We may need to stash that information in the zfs_sb_t when suspending and closing the dataset. Under Solaris the VFS layer would provide it but under Linux we're on our own.
Verify the negated return codes from Solaris are going to cause reasonable behavior when returned to the Linux VFS.
Add any needed compatibility code for older kernel versions back to 2.6.26, this API has changes a little bit I believe.
Absolutely run the zfs_freeze tests from xfsprogs and see how it goes. In fact I'd love to see the full results from xfsprogs I haven't yet run that test suite over the ZFS code.

kohlschuetter commented 13 years ago

Some updates on this feature in my branch: https://github.com/kohlschuetter/zfs/commits/freeze (see https://github.com/kohlschuetter/zfs/commit/f9e8ae55b777031b332da9b748ad4fd842a96933 )

Freeze/unfreeze seems to work with Linux >= 2.6.35 and silently fails with 2.6.32/RHEL6. I haven't tried it with earlier kernels, though.

Before 2.6.35, freeze requires a block device set in the superblock, which zfs does not provide. The RHEL6 kernel can be patched easily by back-porting a few changes.

Having a compatible kernel, freezing/unfreezing seems to work with xfs_freeze, but unfreeze fails with util-linux's fsfreeze (you can freeze it with fsfreeze -f, but unfreeze only works with xfs_freeze -u). The reason actually is that the filesystem really freezes completely. You cannot even perform an fstat (which fsfreeze performs before freeze or unfreeze).

I am not sure about the expected behavior here. Changes on freeze behavior are in fact outside the scope of this patch; they should probably be performed at ZFS suspend/resume level.

baryluk commented 13 years ago

I actually think freeze/thaw is more important for backup scenarios and when underlaying storage have own snapshot/cloning mechanisms (like iSCSI, or LVM on local or remote machine, or snapshots of zVOL over iSCSI, etc).

freeze will make sure underlaying devices are in consistent state, that all direct / synchronized data are in fact pushed into devices, and block all process of further writes to the whole filesystem (this constrain could be relaxed, until we have enough memory, and doesn't performed fsync/fdatasync/create/close/unlink/rename, etc. - they should block only if actuall write IO would be needed to be performed.). After sucesfully freezing file system, one can create safely snapshot / clone on the storage (LVM snapshot, zVOL snapshot, netapp snapshot), then unfreeze zfs, and use snapshot for something (like dump it into tape streamer or other machine).

behlendorf commented 13 years ago

@baryluk: Yes, after more carefully reviewing the kernel code your exactly right. In fact supporting freeze/thaw appears to be only useful if you want to support md/lvm style snapshots under the zfs vdev layer. That's something we probably don't care about. This support isn't needed for proper suspend/resume behavior. In fact, upon further inspection the filesystem doesn't really need to to anything to support this. So my question is... in practice why doesn't it work today?

kohlschuetter commented 13 years ago

So we actually have two problems now:

Suspend/resume doesn't work regardless of freeze/unfreeze.
Freeze now works, unfreeze hangs the FS because fstat hangs on the frozen FS.

paulhandy commented 10 years ago

Has there been any progress on this issue in the last 3 years?

behlendorf commented 10 years ago

@paulhandy this hasn't been a priority for any of the developers. In part this has been because the interfaces provided for freeze/thaw by the Linux kernel have been horrible until fairly recently. I don't think this would be a ton of work if people were interested in working on it.

cyberius0 commented 9 years ago

Since I started using ZFS I made a custom script for the pm-sleep, which does an export of the pool before the system enters sleep mode. So I guess there is no better way to do it?

kernelOfTruth commented 9 years ago

@cyberius0 so you basically log out of X and run the script to initiate suspend ?

been wondering how to do this is if /home is on a zpool - but there's probably only this way

cyberius0 commented 9 years ago

Sorry, I didn't read the whole thread, my /home isn't on a zpool. The zpool is mounted in /RAID. Without the export before going to suspend, the filesystem "freezes". Then every try to access it like e.g. "ls /RAID/" leads to a frozen console/shell and I have to reboot the system to access the RAID again.

ccic commented 7 years ago

@kohlschuetter and @behlendorf , the above implementation has two issues:

When the call of zfs_suspend_fs returns, the caller still holds two locks: 'z_teardown_lock' and 'z_teardown_inactive_lock'. The process cannot exit withholding those locks, otherwise it caused issues. So, a possible modification is call zfs_suspend_fs and sleep for a specified duration, and then calling zfs_resume_fs to release those locks. In other words, calling zfs_suspend_fs and zfs_resume_fs in pairs in order to apply and release locks in a correct way.
zfs_suspend_fs/zfs_resume_fs is not sufficient to freeze/thaw the pool. Consider a scenario, some file systems share one pool, if you suspend the writes from some of the file system, the other file system are still allowed to write to the pool. Moreover, even the file system is suspended, the property setting of pool is still allowed. So, that is not a "real" freezing of the disk. We should explore other methods, for example, freezing uberlock, to meet the "real" freezing feature.

behlendorf commented 7 years ago

@ccic thanks for taking the time to investigate how this functionality could be implemented. The changes proposed here should be considered an initial prototype WIP to help us investigate the various issues. Unfortunately, adding this functionality hasn't been a hasn't priority for the team. Regarding you specific concerns.

When the call of zfs_suspend_fs returns, the caller still holds two locks: 'z_teardown_lock' and 'z_teardown_inactive_lock'. The process cannot exit withholding those locks,

Good point. So one possible avenue worth exploring might be to have a freeze take a snapshot and then use the existing rollback code to effectively pivot on to that snapshot. That would allow us to use the existing model of suspend/rollback/resume except that you'd be resuming on an immutable snapshot.

zfs_suspend_fs/zfs_resume_fs is not sufficient to freeze/thaw the pool.

Using a snapshot would provide a solid guarantee on immutability. As for allowing the pool to be manipulated or other non-frozen filesystems it's not clear that's a problem. The VFS is only requesting that a specific super block be frozen. If freezing an entire pool is needed than alternate interfaces will be needed.

ccic commented 7 years ago

@behlendorf thanks for your sharing of your consideration. I know this feature is not a priority. I just want to get some clues about how to design and implement it. For the existing model of suspend/rollback/resume, I have checked the code, the zfs_ioc_rollback has already contained the logic of suspend and then resume, so a possible avenue is to (1) take a snapshot of a specified file system, (2) rollback it but wait for a while after suspending (here we freeze the fs), then (3) resume it Yes it has the effect of freezing file system. Am I correct?

behlendorf commented 7 years ago

@ccic yes, it should have that effect.

isegal commented 7 years ago

+1 Definitely, would love to have this feature for backup snapshot scenarios in the cloud.

For example on AWS one can perform a point-in-time snapshot of an attached EBS drive. Some backup tools rely on FS flushing and freezing so that the snapshot data is consistent. For example with xfs_freeze we are able to snapshot a raid array with no consistency issues.

An example of this is the mysql-ebs-backup script that's currently tailored for XFS on EBS: https://github.com/graphiq/mysql-ebs-snapshot.

If anyone knows for a workaround (i.e. sync command perhaps)? Please do share.

ccic commented 7 years ago

A quick workaround for one ZFS FS may be: (1) Expose zfs_suspend_fs/zfs_resume_fs to users through zfs command. That only takes less than 100 lines, I think. Since zfs_suspend_fs still holds two locks, so it requires users to specify another input: how long do you want to suspend? for example, 5 seconds (I think it is long enough) (2) Execute "zfs suspend 5" ('suspend' is the command, 5 means 5 seconds) to freeze (3) The other logic on MySQL .. (4) Resume the ZFS after suspend timeout automatically.

For a complete feature to freeze zpool, we have to (1) flush dirty pages and (2) suspend write to the pool. But now, it prevents from being written to disk, unfortunately it excludes synchronous write. It still needs more investigation.

isegal commented 7 years ago

Does /bin/sync flush dirty ZFS pages to disk?

ccic commented 7 years ago

'sync' asks file system to flush the pending write to disk, but I'm not sure about the answer.

MyPod-zz commented 7 years ago

@isegal no, sync doesn't guarantee that data lands on disk, only that it reaches at least the zil. The issue on O_DIRECT should explain why, iirc.

gasparch commented 7 years ago

Hi all!

So what is a progress on this ? I'm trying to setup tuxonice + ubuntu root/home over zfs + LUKS to encrypt everything under zfs.

If I use zfs as a root fs for system and execute hibernate to some other swap partition (which saves all kernel memory + some page cache) what potentially can go wrong with internal zfs structures? Even if there is data to be written to main disk - it will be in zil and will be saved as part of memory during hibernation process, isn't it?

If system fails to restore from hibernated state - that would be equal to poweroff from OS point of view - which is totally possible scenario during normal work as well and I'm not sure I want to protect from this one.

If hibernation process restores properly and continues execution - how does it hurt ZFS? What structures can be corrupt? Do ZFS rely on kernel page cache (because this one is partially cleaned during hibernation)?

candlerb commented 6 years ago

The primary application for me would be taking live block-level backups of virtual machine images, where the VM (guest) is using ZFS.

In theory if I took a snapshot while it was in the middle of flushing a transaction group or writing the ZIL to disk, the filesystem image would still be consistent and recoverable. I would be happier if I didn't have to trust to this.

There is a Qemu hook to "quiesce" a guest's filesystem before snapshotting it, by means of an agent running in the guest: see

It appears that under the hood, guest-fsfreeze-freezejust calls fsfreeze, presumably on all mounted filesystems. If a guest has ext4 for root and zfs for /data I'd like the call to succeed, i.e. saying that it successfully froze all filesystems.

hermes-pimentel commented 6 years ago

behlendorf opened this issue on May 31, 2011...

candlerb commented 6 years ago

@gasparch:

Even if there is data to be written to main disk - it will be in zil and will be saved as part of memory during hibernation process, isn't it?

No: the ZIL is only ever used for synchronous writes.

What happens in normal operation is that ZFS batches up writes in RAM into a transaction group, and then flushes that transaction group to disk. Since ZFS doesn't overwrite any blocks in-place, you can in principle pull the plug at any point, and see either the filesystem state entirely as it was prior to starting to write the TG, or after the complete successful write of the TG. So at least the filesystem state is always consistent. (*)

But what happens to the blocks in an open transaction group during a hibernate/suspend, I don't know.

() This should be safe as long as the underlying disk isn't doing a scary amount of write reordering, such that the uberblock is written out before* the blocks it points to.

dankimmel commented 6 years ago

If your goal is to serialize filesystem IOs before / after some event, and you don't care exactly when that event is with respect to other events outside of ZFS, I think you can work around the absence of this feature using zpool checkpoint today. You'd follow this procedure:

when you're getting ready to take a hardware snapshot of the pool, run zpool checkpoint <pool> (which serializes the IOs to be before / after the next txg_sync and stores the uberblock so you can reference the old layout of the disk later)
take a hardware snapshot of the pool
discard the checkpoint on the live system using zpool checkpoint --discard (so that the spa can free the space taken by overwritten data again)
if you ever restore from the hardware snapshot, run zpool import --restore-from-checkpoint to force it to go back to the serialization point stored in the checkpoint

This has the added advantage of not requiring all user IOs to pause while you take the snapshot, but you can't synchronize it with external events since the serialization point just happens "sometime" between when the zpool checkpoint command starts and when it finishes.

dankimmel commented 6 years ago

Actually, you'd also want step 5: after restoring and doing import --restore-from-checkpoint, you'd also want to discard the checkpoint with zpool checkpoint --discard (for the same reason as in step 3)

eblau commented 4 years ago

Wow, I did not realize this support was missing. I run ZFS as a root filesystem on a laptop. I've been successfully hibernating and resuming for years up until I upgraded to Linux 5.7.7 with ZFS 0.8.4. I've corrupted my zpool twice in the past month from hibernating and resuming with these versions.

I presume this is "expected" since these hooks are not present for ZFS. We should never hibernate and resume with ZFS? Is there any way ZFS can block hibernate if there are zpools online for safety? Otherwise, how are users supposed to know this?

xuanruiqi commented 4 years ago

Hmm, so is this issue the only thing blocking hibernation support for ZoL?

cytrinox commented 4 years ago

Is the missing implementation only problematic for root on zfs or for every scenario? I want to build a 8x4TB raidz2 on my new workstation where root is on a mdraid/ext4 but I need hibernate (suspend2disk) every day.

AttilaFueloep commented 3 years ago

Just a random data point: I'm using hibernation with root on zfs for years without any issue. Thanks to arch Linux and zfs-git always on current versions. Same use case: notebook, at least daily hibernated. It wouldn't hurt to checkpoint the pool before hibernating though.

8573 commented 3 years ago

As a different data point, in 2015 or 2016 and possibly as recently as 2019 (looking at Git history), I occasionally accidentally would suspend (or hibernate, I dunno) my laptop with root on ZFS, because logind would default to doing so on certain hardware events (esp. lid close), and this would result in ZFS-related error messages printed on the kernel console that would frighten me into trying harder to configure logind not to do that and to remember not to close the lid all the way.

(Edit: I don't know whether it was "suspend" or "hibernate", but it produced scary warning messages.)

bghira commented 3 years ago

^ hibernate and suspend are NOT the same thing. suspend is safe, hibernate is not - and doesn't work, anyway, because required initrd code isn't there.

Greek64 commented 3 years ago

@misterbigstuff I most definitely use hibernate on a daily basis (with the occasional clean reboot every 1-2 weeks) on a zfs-on-root debian testing system for about 1.5 years now without any problems...

bghira commented 3 years ago

with swap on ZFS? no non-ZFS partitions?

Greek64 commented 3 years ago

Well yes, the swap is on a separate non-zfs LUKS encrypted partition. But this issue is about freeze/thaw support for hibernation (Not hibernating using swap on zfs)

bghira commented 3 years ago

there are multiple issues. pure ZFS hibernate will not work because of both issues, but your setup will occasionally fail because of just this one.

eblau commented 3 years ago

@Greek64 @AttilaFueloep I was in the same situation as you. Daily hibernates with a LUKS encrypted ZFS root and a non-ZFS LUKS swap partition was working fine for me for years. Then I got multiple zpool corruptions within a month of each other and came across this GitHub issue. I would strongly recommend not hibernating. Suspend works perfectly. If you do continue using hibernate, be sure to have a recent backup.

AttilaFueloep commented 3 years ago

@eblau I really appreciate your recommendation and I know that I'm operating outside specifications here. What I do to protect against corruption is to run zpool checkpoint in the process of hibernating. The hope is to be able to rewind to this checkpoint, should to the pool go corrupt. I've recent file and snapshot based backups as well of course, not least because I'm running the zfs master branch. So I think I'm going to continue to use hibernation unless bitten by pool corruption (once bitten twice shy ;-).

By the way, our setups differ in mine having root on zfs and just /boot and swap on LUKS. maybe this makes a difference?

kloczek commented 3 years ago

Someone can summarise current state of this issue? What is dine? What needs to be done? Something else?

asdf8dfafjk commented 3 years ago

Since no reply, I can tell you something that definitely fails: zfs on root, hibernate it to disk. Upon turning on, if the same root is booted again (which will eventually start wake up of the hibernated state) then there is a chance your system will go kaput. (I've had 4 zfs dataset losses this way).

Extremely simple solution:

I now have two installations (both share swap partition)- 1 where I do main work. Which I keep suspending all the time. Second which I use for main grub entry and which is the one that actually wakes up my original work (thus avoiding an import of the pool)

EDIT: After 4 broken installations (in probably 3-4 months) before I came across this, I've been using my solution for 4 months (as of 2021-03-02) and things are perfect.

luke-jr commented 3 years ago

Ugh, shouldn't ZFS at least inhibit hibernation so long as this is a problem? (So the hibernation attempts fail instead of corrupting stuff)

xuanruiqi commented 3 years ago

I agree. If root is on ZFS, ZFS should probably inhibit hibernation.

asdf8dfafjk commented 3 years ago

@luke-jr @xuanruiqi

Guys I use hibernate too and sorry if it's rude but I dont think it's appropriate to use "should" for volunteer work. These guys are giving away such an amazing (in the true sense of the word, schmucks like me now have mirrors so I dont care about hdd crash) piece of software for free and telling them what should or should not happen doesn't seem appropriate.

Perhaps you could start a new issue and discuss design for your request and consider a PR?

Atemu commented 3 years ago

Discussing what should be done about an issue, including not immediately obvious short-term mitigations, is in no way disrespectful.
OpenZFS is a FOSS project and therefore given to us "for free" but that does not mean that it's free of issues. Discussing these is absolutely fine (and necessary IMO) as long as it's done constructively and in a nice tone.

openzfs / zfs

Support freeze/thaw #260