openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.57k stars 1.74k forks source link

"zfs clone" with BRT #15600

Open rincebrain opened 11 months ago

rincebrain commented 11 months ago

Describe the feature would like to see added to OpenZFS

Now that block cloning has landed, it'd be cute to write an improved version of "zfs clone" that regenerates the entire snapshot source with BRT'd L0 blocks as a "clone" without the parent-child dependency that makes clones so inconvenient for some use cases.

How will this feature improve OpenZFS?

I personally don't use clones, ever, if I can avoid it, because the performance costs when you go to clean it up, not to mention the nasty web of dependencies you incur when you keep cloning things, aren't great to manage.

Making truly independent "clones" that only share data blocks would be ideal, for my and I suspect many other use cases. If the metadata is a significant fraction of the dataset, then obviously a BRT cloned dataset would be more expensive, but, it is rather difficult even if you're trying to construct a nontrivial dataset where that's true.

Additional context

Of course, with BRT, you could just make a clean new dataset and do cp -a --reflink=auto /some/snapshot/ /new/dataset/ on Linux, subject to copy_file_range not promising it always clones things, and needing to preserve all the subtleties of any ACLs or anything else set on the dataset...or you could make it a single command, since we already have one that purports to do that.

It'd be nice to make a ZCP do this, perhaps, but I think that would require extending ZCPs to know a lot more about the underlying file objects, or implement a primitive for "clone all properties of file" and use that, if I recall what ZCPs can and cannot do offhand.

Or we could just make a magic ioctl that does this like the old clone ioctl does, but better.

Majiir commented 11 months ago

the performance costs when you go to clean it [a clone] up

Why would this be any better with a BRT-based filesystem clone? If I understand correctly:

The slow clone deletion method (prior to the livelist method introduced in https://github.com/openzfs/zfs/pull/8416) traverses the block tree of the clone, skipping nodes that were born prior to the clone. In the very worst case where the clone has completely diverged from its parent snapshot, this is equivalent to walking the entire block tree of the clone.

For a BRT clone that has no parent, you would always have to walk through the entire block tree of the clone, even if the clone has barely diverged from the source snapshot. Then, there is BRT accounting on top of each free.

Meanwhile, clone creation is cheap, while BRT-based clones would require a full walk and BRT entries created upon clone.


not to mention the nasty web of dependencies you incur when you keep cloning things, aren't great to manage.

BRT-based clones would avoid the problem that you cannot delete or split a clone parent (see https://github.com/openzfs/zfs/issues/2105). In that issue, I wrote a comment proposing that block cloning could facilitate a clone parent split operation. Rather than cloning every block, this would scan the clone parent snapshot for the blocks born in that snapshot and clone at most those (minus blocks no longer referenced in the clone). In other words, we can use the BRT more selectively and continue to leverage the snapshot and clone mechanisms that work for us.


Clones have an advantage that you can zfs send them, while for BRT this is not (yet?) an option.


In general, many problems can be solved with BRT. In principle, we could reference-count everything (take a snapshot? BRT all the blocks!). There are complicated schemes for doing accounting on blocks so that we can keep certain operations fast. I think we should look for similar ways to solve clone-related problems and use BRT sparingly.

rincebrain commented 11 months ago

The postscript on your comment is condescending and rude, please don't write comments like that on here.

Majiir commented 11 months ago

Sorry that you took it that way, as that wasn't my intent. I edited it to avoid misunderstandings.

rincebrain commented 11 months ago

So let's break this down for a moment.

Block cloned clones would have to walk the entire metadata tree every destroy

You have to do that in either case, because that's how you know what's being freed. It's just that clones right now have to also walk the livelist or the whole old tree to figure out what they can actually free.

And the livelist, if I understand how it works correctly, keeps growing the longer the clone exists, with some condensing, unboundedly, while if you have a bunch of L0 data records cloned, then as the hypothetical BRT-clone diverges, you have fewer things you have to check.

So even if the BRT-clone was more expensive per data record delete, which I don't think it would be by nature, it'd be a win the farther the clone diverges.

Clone blocks more selectively with a clone-promote operation to divorce the clone from the parent

That'd be fine, and isn't really in conflict with this proposal, except that doing that would incur updating all the metadata of the clone now for the newly "written" BRT blocks, which is the same cost you'd have incurred initially for generating it.

You can send clones

You can mark anything as a clone if you feel like it with the -o origin= functionality, it just uses nopwrite to write everything again and hopes that'll hide the delta, but the size of the snapshot sent is the full one for either functionality.

Vlad1mir-D commented 11 months ago

I would argue that it's better to add a flag to zfs clone which would allow user to specify the way this clone should be made, i.e. either with BRT or the good ol' way.

rincebrain commented 11 months ago

I don't think in any way I argued that we should remove the old zfs clone.

People seem to keep making up things I didn't suggest to argue against.

GregorKopka commented 10 months ago

I think the he just suggested that this feature should be added in a way that it can be triggered through a flag to zfs clone.

Vlad1mir-D commented 10 months ago

I think the he just suggested that this feature should be added in a way that it can be triggered through a flag to zfs clone.

This ^