[Feature] Compression Migration Tool

PrivatePuffin commented 4 years ago

Describe the problem you're observing

Currently one can change the compression setting on a dataset and this will compress new blocks using the new algorithm. This works perfectly fine for many people during normal use.

However, there are 3 scenario's where we would want an easy way to recompress a complete dataset:

If one wants to change decompression speed for currently stored write-once-read-many data
If one wants to increase the compression ratio of currently compressed data
If we remove (or depricate) a compression algorithm

While it's perfectly possible to send data to a new dataset and thus trigger a recompression, this has a few downsides:

It's not very accessible for the simplest of users, for example (future) freenas home users
It might mean downtime

A prefered way to handle this would be a features which recompresses current data on the drive, "in the background", just like a scrub or resilver. this also has the added benefid of making us able to force it if we depricate/replace/remove an algorithm.

This feature would enable us to go byond the requested deprication in #9761.

scineram commented 4 years ago

How do you plan to implement this? What is to happen to snapshots? Recv already does this.

PrivatePuffin commented 4 years ago

@scineram Snapshots would be a problem indeed. I don't have a "plan" to implement this, otherwise I wouldn't file an issue ;)

How to you suggest to do future removal of compression algorithms and zero-downtime change of on-disk compression otherwise, I don't think recv covers this usecase, or does it?

If so: Where is the documentation about using recv in this way? it would have a very low downtime ofcourse...

richardelling commented 4 years ago

this requires block pointer rewrite

PrivatePuffin commented 4 years ago

@richardelling Precisely, I didn't say it was going to be easy ;)

InsanePrawn commented 4 years ago

this requires block pointer rewrite

I personally would be fine if this feature initially behaved like/leveraged an auto-resumed local send/receive and some clone/upgrade-like switcheroo (and obeyed the same constraints, if unavoidable even temporarily using twice the required storage of the dataset being 'transformed') in the background with the user interface of a scrub (i.e. trigger it through a zfs subcommand, appears in zfs?/zpool status, gets resumed after reboots, can me paused, stopped, etc.).

The applications for this go beyond just applying a different compression algorithm:

AFAIK this also applies to checksum algorithms.
Shouldn't this also convert xattrs to the sa format?
If there's sufficient free space on the pool, this can also be a form of defragmentation, right?

One could hack something like this together using zfs send/recv, it'd probably involve a clone receive and some upgrade shenanigans, but it would definitely not be the same as having a canonical zfs subcommand with the above-mentioned UX; especially since it would somewhat cleanly resolve some "please unshoot my foot" situations that inexperienced and/or sleep deprived users might get themselves into, for example choosing the wrong compression algorithm/level a year before realizing it, without the need to figure out and possibly script (recursive) zfs send and receive. Also, zfs is probably in a better position to do a much cleaner in-place swap of the two versions of the dataset when the 'rewrite' is done, probably like a snapshot rollback, and will most likely not forget to delete the old version afterwards, unlike my hacky scripts, which break all the time. 😉

Future future work ideas:

Defrag mode: Only rewrite fragmented datasets for some definition of fragmented. (Without knowing implementation details, sounds like it could be a two phase process like scrubs?)
-o encryption=on from off might be a useful thing to support, now that we allow unencrypted children -> A future³ pr might add a way to migrate between crypto ciphers.

PrivatePuffin commented 4 years ago

@InsanePrawn

Good point, it should
Could be interesting
Considering All data should gets read and re-writen sequentially, it would defrag the drive, yes.

One could hack something like this together using zfs send/recv, it'd probably involve a clone receive and some upgrade shenanigans, but it would definitely not be the same as having a canonical zfs subcommand with the above-mentioned UX

Yes, thats mostly the point... I think more advanced users can do things that get pretty close (and pretty hacky), but creating it to be "as easy as possible" for the median user was the goal of my feature request...

Lady-Galadriel commented 4 years ago

@InsanePrawn, given enough space, yes, a transparent ZFS send/receive would be a way to go. All new writes go to the new dataset, and any read not yet available in the new dataset would fall back to the old dataset. Whence the entire dataset is received, the old dataset is destroyed.

Theoretically, we could almost do it without enough space for the whole dataset. Whence one file is entirely copied to the new dataset, the file could be deleted from the source dataset.

If something like this were implemented, a resume after Zpool export would also have to be part of the work. Otherwise, the pool would remain in a partially migragted state.

This does have the advantage of re-striping the data. Simple example, you have 1 vDev and when it get fullish, you add a second vDev. The data from the first, (if not changed), remains only on the first vDev. Even newly written data may have to favor the second vDev as it has the most free space. Something like suggested above can help balance data, even if we don't need to change checksum, compress or encryption algorythms.

Back to reality, snapshots & possibly even bookmarks would be a problem. Even clones of snapshots that reference the old dataset would still reference the old data & metadata, (be it compression, checksum or encryption changes).

hhhappe commented 4 years ago

I think a simple "reseat" of a file/dir interface would be the most practical. I.e. an operation that did this transparently:

cp A TMP rm A mv TMP A

Perhaps not the easiest to implement. Lustre has a similar feature called "migrate", which is more about re-stripping data.

Snapshots etc should just keep referencing the old data.

fredcooke commented 3 years ago

My interest is in copies= changes as per the above mentioned ticket (2 to 1, in particular). In that specific sub case it feels like it should be something like:

file handle > list of pointers to data and properties thereof with duplicates for every block if copies=2

So in my mind (and perhaps not in the source code) it would be as simple as looking at that structure, choosing one of each duplicate to release, removing one duplicate from the list, and freeing that region of block device for reuse. And in reverse, iterating over the single set of blocks and writing a new one for each and adding them to the list/set.

I could see the encryption and compression being more difficult as you'd have to decode the existing block and write it again with a new algorithm and then swap the entire block set out for the file in question, somehow atomically. I'm not sure if there's a layer there that would allow two sets of blocks under the hood and the file handle to switch pointers from one to the other.

gmelikov commented 3 years ago

@fredcooke metadata is checksummed too, so we can't easily change those structs. But you're right, copies case may have a room for somewhat hacks to ignore wrong (freed and already reused region) copy, the question is how ugly and cheap it may be.

fredcooke commented 3 years ago

Surely a checksum can be recalculated and rewritten too, just as if the file itself is modified, no?

What both of these tickets need is a champion who is expert in the guts of this beast to come up with a coherent thorough file-re-write-in-place plan and then delegate the work out to mere mortals like me :-D

gmelikov commented 3 years ago

Surely a checksum can be recalculated and rewritten too, just as if the file itself is modified, no?

Aaand you need to recalculate checksums for all blocks in Merkle tree (if you try to change existing blocks inplace, which we shouldn't do in ZFS CoW paradigm). For a general solution you might want to look at "block pointer rewrite" idea, which is hard to implement https://github.com/openzfs/zfs/issues/3582#issuecomment-123901505

Not wanted to demotivate you, it would be really great to have bprewrite at last!

fredcooke commented 3 years ago

So I watched Matt Ahren's 1.5 hour 2013 OpenZFS talk on YT and BPrewrite as he describes it is something that would modify the past, not just the present, and is therefore risky and difficult as detailed in that video and elsewhere. I don't want that.

I want snapshots to remain immutable and honest, I think anything else is harmful though I see the uses for BPrewrite for defrag, device remove, rebalance, etc and perhaps those will always be pipe dreams in order to keep the project moving or perhaps an offline only variant for those would be fine.

However something lower level than cp/mv and less painful than send/receive would be nice to have for rewriting files but without doing it in userspace and without doing it globally and making a snapshot a lie.

I'd be happy enough with something zfs-aware that could rewrite a tree of files as needed in order for latest settings to stick in the knowledge that the data would be in addition to prior snapshots and thus require some snapshot cycling to reclaim the space (normal). Then I could gradually migrate sub-dir at a time or dataset at a time and once the earlier snapshots were all gone, the space would naturally be freed and there'd be room to do the next one. etc.

Might be time to start poking around the source instead of talking hypothetically at a high level about something I know nothing about :-D

djdomi commented 2 years ago

Dear All, I would like to support this question and request, due to the fact I use mostly lzo or lzo4 for compression and for some storage kind I would like to use afterwards zstd with maximum compression or zstd-fast

the wired think is, that (I know that ZFS is not btrfs) BTRFS can recompress its files i.e.: https://askubuntu.com/questions/129063/will-btrfs-automatically-compress-existing-files-when-compression-is-enabled https://wiki.ubuntuusers.de/Btrfs-Mountoptionen/

I am still wondering why not it has been implemented into ZFS

Konrni commented 2 years ago

possible duplicate of #3013? But i still hope it will come someday.

danieldjewell commented 2 years ago

@djdomi

While I'm not an expert on ZFS by any means - I do know that changing the compression on btrfs is accomplished through the "defragment" mechanism ... but this has a number of pitfalls (which, as of now, AFAIK, haven't been solved) -- notably, that it removes deduplication. (Deduplication in btrfs is completely different than in ZFS -- offline (e.g. after data is written, similar to NTFS deduplication) vs. online/on-the-fly...)

If one wants to increase the compression ratio of currently compressed data

For the most part decompression speeds are really fast with both ZSTD and LZ4. In fact, ZSTD is somewhat unique that decompression speeds are pretty constant regardless of the compression level - this article from the FreeBSD Journal has an excellent analysis of this very thing... (Their results indicate that in some cases ZSTD decompression can be even faster than LZ4... not to mention faster than no compression...) Given that they're both fast and pretty constant, I would suggest that there isn't much to be gained by changing compression in order to improve decompression speed (unless you're not using ZSTD/LZ4)

With that in mind, the problem comes down to: tuning the speeds of new writes (which can easily be done with zfs set compress=<whatever> pool/vol) and possibly "upgrading" compression to ZSTD/LZ4. (Which, while not perfect, can be done -for the most part- with a zfs create and rsync)

Given that (1) decompression speed is always (? nearly always?) faster with ZSTD/LZ4 as compared to no compression, I can't imagine a scenario where you'd want to remove it .... ? (And if it isn't faster on your hardware, that's something that should be tested/benchmarked/discovered before putting a system into production.)

openzfs / zfs

[Feature] Compression Migration Tool #9762

Describe the problem you're observing