openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.51k stars 1.74k forks source link

A way to set a default volblocksize for a pool #15947

Open Rudd-O opened 7 months ago

Rudd-O commented 7 months ago

Currently there appears to be no way to set up a default volblocksize for a pool, or a dataset, such that any volume created within the container has a designated volblocksize.

What this means is that any software which creates volumes (I'm thinking the storage driver in Qubes OS as an example) must manually specify a hard-coded volblocksize, with no input from the administrator. This isn't always possible.

(For context: the default volblocksize of 16K, combined with ext4 file systems atop the volumes created that way, is resulting in a guest->host write amplification of 3-4X, which is ridiculous.)

Describe the feature would like to see added to OpenZFS

A way to do a set volblocksize on a dataset which in turn will cause any volumes created within to use that volblocksize, just like recordsize is inherited today.

How will this feature improve OpenZFS?

It will be possible to have inheritable policy for creation of volumes according to performance requirements, or global per-pool defaults.

rincebrain commented 7 months ago

FWIW, setting a very low volblocksize is often an exceedingly poor idea ,which is why the default was 8K, and is now 16K normally (and higher if you're using draid).

Separate from that remark, at least in recent memory, the closest analogue I can think of is how special_small_blocks used to not do anything on volume datasets even though it was inherited and settable.

But the two important differences here are, A) volblocksize isn't mutable after creation, so a naive implementation would need to walk children for an explicit zfs set if you modified it, and B) it would break people who predated volblocksize being a valid property there, potentially, to not have an explicit property set, leading to...

C) I think this probably doable if you just make the volume creation step trigger an explicit set if the property would have been inherited, and at that point, we get to

D) I think you would probably want something like default_volblocksize as a property, because I would expect setting volblocksize on filesystems to also break at least DEBUG builds of older code, and at the point where it's not inheritable by the volumes themselves, using a new property to affect defaults seems more reasonable than making the existing property "inheritable" when you can't change it, so inheritance isn't really the right mental model for it.

Does that make sense to you/fit your goal here?

I could also see an argument for a pool-wide property, but that seems janky for a number of reasons, like wanting different ones for different datasets, not being preserved in send-recv, and so on.

GregorKopka commented 7 months ago

In case of having an additional property listed breaking old code: it would happen with any new property. So options are to either fix the old code or never ever introduce anything new... my vote is for the former.

While volblocksize is currently immutable, is there a really good reason why it needs to stay that way?

rincebrain commented 7 months ago

Yes, because it would be quite complicated to implement changing volblocksize on a volume after creation, for the same reasons changing recordsize on a dataset doesn't affect files larger than one record already. If you'd like to go implement it in a performant way and open a PR, by all means, but it's quite involved.

Nobody was suggesting doing nothing, or that the options were do nothing or break things. And entirely new properties get ignored if they're not recognized, generally, so it's not the same at all.

GregorKopka commented 6 months ago

I would expect parsers to only look at properties they expect for the dataset type queried, hence I would classify a parser that fails when encountering volblocksize being returned by zfs get all on a filesystem as being defect.

Having looked a bit into options to make volblocksize and (while at it) recordsize mutable for existing files: As ZFS locates the DVA location for a requested offset in the block pointer tree (of a file/volume) by shifting with the set size of the file/volume... the so-far best idea I came up with would be to introduce an indirection for the metadata DVA pointers that allows them to point toward the old block pointer tree (still using the prior block size) for not-yet rewritten blocks - and on record size change write a whole new block pointer tree for the file/volume, that fully indirects back toward the old metadata, on writes (if needed) do R/M/W to pull in data from the old blocksize and free the old (meta +) data if no longer referenced by snapshots.

Downsides I see with this would be the need for a backwards incompatible read on-disk format change, having to take the indirection for all non-rewritten blocks on reads and the added code complexity (and the ability for new and exciting bugs to creep in) caused by the indirection.

Given all that... I lean towards a solution that would enable zfs recv to change the recordsize / volblocksize on freshly received files/volumes, which should be way easier to implement as the zfs streamdump format already delivers the data in an on-disk block-size agnostic format (offset+length within the file/volume).

Rudd-O commented 6 months ago

I don't want to change volblocksize on existing volumes. I want a default volblocksize property for newly-created volumes.