openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.46k stars 1.73k forks source link

ability to supend (and resume) all zpool IO (`zpool suspend|resume`) #12843

Open problame opened 2 years ago

problame commented 2 years ago

(This is a feature request extracted from https://github.com/openzfs/zfs/issues/260#issuecomment-982124508 )

Background

Linux supports freezing (and thawing) a mounted filesystem through an ioctl. The use case for this is to suspend all IO requests to the underlying block device. The use case for that is to enable block-device level snapshots, e.g., if the filesystem is deployed on top of a snapshot-capable volume manager. Note that freeze is not used during hibernation, contrary to what's stated in the opening comment of ZFS issue https://github.com/openzfs/zfs/issues/260. As far as I'm aware, the above is an exhaustive description of the use case for freeze & thaw.

The Linux VFS provides two mechanisms for filesystems to suspend freeze&thaw. The first is to implement the freeze_fs / unfreeze_fs super block operations. The second is to implement the freeze_super / thaw_super super block operations.

If a filesystem implements the _fs type of operations, the VFS takes care locking out all VFS operations by means of a set of rwlock. Here's the kernel function freeze_super that is invoked from the freeze ioctl in that case. (Don't confuse the kernel function freeze_super with the ->freeze_super super block operation).

If a filesystem implements the _super type of operations, the ioctls map more or less directly to these callbacks.

However, neither of the hooks above are suitable for ZFS. The reason is that the the Linux concept of freeze&thaw expects that one super block has exclusive control of N block devices. Whereas, with ZFS, M super blocks (= ZPL datasets) share the storage of N block devices. And then there's also management operations such as zpool scrub and zfs recv that perform IO and are not represented by super blocks at all.

Of course, looking at how btrfs does it makes sense in this case. It's the mainline filesystem most similar to ZFS with regard to pooled storage (multiple blockdevs!) and multiple super blocks on top. Btrfs implements freeze_fs/unfreeze_fs. But the btrfs secret sauce is that a single btrfs filesystem (= pool in ZFS terms) only has a single struct suber_block - the subvolumes (= ZPL datasets in ZFS terms) are implemented through mount_subtree.

UX Proposal

Instead of implementing the {freeze,unfreeze}_* ioctls, I propose to implement two new zpool subcommands. Here's the man page that describes how the feature behaves towards the end user.

zpool freeze POOL TAG

  Request the zpool to be temporarily frozen.
  On success,
    - all currently dirty data in zpool POOL has been synced to disk,
    - active ZILs are empty and will not need replay,
    - claimed but unreplayed ZILs _will_ need replay,
    - all I/O operations to all vdevs in the pool are suspended,
    - new DMU operations (= all ZVOL and VFS operations) that dirty data block,
  until the corresponding `zpool unfreeze` operation.

  The TAG identifies this freeze request and must be passed to
  `zpool unfreeze` for undoing the freeze operation.
  TAGs must only have characters in `A-Za-z0-9-_:`.

  If a `freeze` for the same TAG is already present, the freeze operation fails.

  A frozen pool cannot be exported.
  If the system crashes while the pool is frozen, the import fails
  unless the pool is imported with zpool import --unfreeze TAG.

zpool unfreeze POOL TAG

  Removes the freeze for TAG that was created by a previous `zpool freeze`.

  Once the last freeze on POOL is removed, I/O operations for POOL are resumed.

zpool freezes [-p] POOL [TAG]

  zpool freeze POOL lists all active `freeze` TAGs on POOL.
  Each row contains the freeze TAG along with the first line of the --description if one was provided.

  zpool freeze POOL TAG
    - fails with an error to stderr if a freeze with name TAG does not exist on the POOL or
    - succeeds and prints the --description provided on freeze to stdout.

Notes:

bghira commented 2 years ago

wouldn't this just hang forever if the system is on a ZFS root?

rincebrain commented 2 years ago

Secretly, zpool freeze is already a command, albeit with big flashing "don't DO that" notes: https://github.com/openzfs/zfs/blob/f04b97620059d08b37d9e80ada397e742bb2f311/cmd/zpool/zpool_main.c#L10958-L10974

problame commented 2 years ago

I'm aware of zpool freeze. As stated in the comment, that's for debugging. We should probably rename to something like zpool slog-test-freeze or whatever to prevent misremembered commands if we add zpool suspend|resume.

rincebrain commented 2 years ago

I'm aware of zpool freeze. As stated in the comment, that's for debugging. We should probably rename to something like zpool slog-test-freeze or whatever to prevent misremembered commands if we add zpool suspend|resume.

Sure, sorry, I wasn't trying to suggest it would serve here, merely remarking that the name is used, and given ZFS's strong disinterest in breaking prior expectations for things, it might be uphill to convince people to rename even such an internal thing. (Also I don't assume anyone knows it exists, after I was quite surprised when I found it in the test suite one day.)

Then again, zdb has had options renamed around a few times, so maybe nobody will blink.

GregorKopka commented 2 years ago

The UX proposal has some typos errors at the end, it should read freezes instead of freeze:

zpool freezes [-p] POOL [TAG]

  zpool **freeze** POOL lists all active `freeze` TAGs on POOL.
  Each row contains the freeze TAG along with the first line of the --description if one was provided.

  zpool **freeze** POOL TAG
    - fails with an error to stderr if a freeze with name TAG does not exist on the POOL or
    - succeeds and prints the --description provided on freeze to stdout.

Also: I would prefer to have zpool freezes to only list the freezes, with a -v to also get the full descriptions.

deliciouslytyped commented 2 years ago

I don't remember my use-case right now, but can this be used like the suspension that happens when a storage device is physically yanked off the bus? (Without actually physically yanking it off the bus.)

MasterCATZ commented 1 year ago

so what should be done to allow PC to be suspended ?

every-time I have suspended I end up with a failed pool with unrecoverable error's

with raidz3 thankfully so far at least enough data still ok that it would re-silver I still do not want to end up in the situation I had with raidz2 and lost everything

all the disks show read / write / checksum errors , when it resumes , if I leave the PC on and thrash the drives for a month I have 0 errors so its not hardware issues , but every time I try suspend to RAM / hybinate , the pool falls apart

and considering the HDD's are chewing up 1.2kw and I only use the stored data 1% of the time it would be good being able to hybinate the NAS , trying to unmount / export is a pain even -force refuses to do it sometimes I just have no choice but to change fstab options and reboot and manually turn off the power to the disks and when I do need them turn disks back on and mount

not zfs on root its just a storage pool

5.18.19-051819-generic zfs-2.1.99-1389_g48cf170d5 zfs-kmod-2.1.99-1389_g48cf170d5

GregorKopka commented 1 year ago

all the disks show read / write / checksum errors , when it resumes , if I leave the PC on and thrash the drives for a month I have 0 errors so its not hardware issues , but every time I try suspend to RAM / hybinate , the pool falls apart

My suspicion is that the suspend/hibernation kicks in in the middle of a TXG, with some data already being written to disk but the referencing metadata (including metaslabs and uberblock) not yet being persisted. Then an import would pull the (from perspective of the frozen state) out of date metaslabs, pick free space that already contains new data and 'repurpose' that... when the hibernated state is restored the TXG will continue to commit to disk, writing metadata that references data that is thought to be stable on-disk but was overwritten by the import (which is invisible to the hibernated state).

Check the logic inside your initramfs, most likely your distribution first imports the pool r/w and afterwards checks for hibernation state - which loads an in-memory state - triggering this scenario above. See https://github.com/openzfs/zfs/issues/14118#issuecomment-1303563790

nh2 commented 2 months ago

so what should be done to allow PC to be suspended ?

@MasterCATZ Can you clarify, you mean "hibernated", not "suspended" as in suspend-to-RAM, right?