compression=lz4hc support

ZeroChaos- commented 11 years ago

The title says it all. I have a lot more processing power than IO throughput and I believe lz4hc would be advantageous over lz4. Can support for compression=lz4hc be added?

DeHackEd commented 11 years ago

If processing power is really that much higher, I'd suggest just using gzip instead. No one's done lz4hc, mainly because the compression speed is staggeringly worse than lz4 itself.

See http://wiki.illumos.org/display/illumos/LZ4+Compression -- The "LZ4 Performance" section links to a spreadsheet detailing compression statistics. Other than decompression speed, using gzip-6 (the default compression level) is superior to lz4hc. Is decompression speed still a big concern?

ZeroChaos- commented 11 years ago

it seemed like it would be a good thing, but if it is a lot of effort for no expected return then feel free to close. If it is perceived a trivial task though, it may be worth it for the testing alone.

aarcane commented 11 years ago

This would be useful for a number of situations. Namely, the WORM cases. Lz4hc does more work on compression to create a smaller, completely binary compatible, compressed chunk that decompressed even faster than lz4.

This would be useful for a number of situations. The main ones being virtual machine distribution, vm master images, and long term data archival. I envision setting compression=lz4hc, writing out the basic data of an image, setting compression to lz4, then cloning to create a live representation.

Because the on disk format is binary compatible, there's no reason to differentiate between lz4 and lz4hc when storing data to disk and every reason not to.

I think the WORM cases provide enough merit to undertake this feature, assuming it's not difficult to accomplish. On Nov 25, 2013 6:46 PM, "ZeroChaos" notifications@github.com wrote:

it seemed like it would be a good thing, but if it is a lot of effort for no expected return then feel free to close. If it is perceived a trivial task though, it may be worth it for the testing alone.

— Reply to this email directly or view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/1900#issuecomment-29263450 .

behlendorf commented 10 years ago

I've no objection to leaving this open as possible feature if someone would like to work on it.

DeHackEd commented 10 years ago

I'll give it a spin if nobody else wants to.

First step is determining if I need a new feature flag or if I can manage without one.

aarcane commented 10 years ago

The only thing you need to change on disk is you need to somehow store whether you're using lz4 or lz4hc, but that can be a user property which will require no feature flag. On Dec 3, 2013 1:50 PM, "DeHackEd" notifications@github.com wrote:

I'll give it a spin if nobody else wants to.

First step is determining if I need a new feature flag or if I can manage without one.

— Reply to this email directly or view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/1900#issuecomment-29755094 .

Sachiru commented 9 years ago

I'd like to add another use case for lz4HC:

/, /bin, /boot and /lib

These are generally immutable and rarely updated, however an increase in compression ratio will always be welcome. Having rootFS and libraries/binaries on LZ4HC would be a godsend because of smaller transfers.

vozhyk- commented 9 years ago

I've made a naive port of LZ4HC (looking at how lz4 was introduced and doing the same) in https://github.com/vozhyk-/zfs/tree/lz4hc. What is left to do is to clean up the code (which isn't fully adapted to the illumos || Linux coding style, and contains parts of the code in lz4.c, lz4.h), test it properly and update the manpages.

So far the compression seems to work correctly (diff not finding any differences between copies of a Linux tree compressed with lz4/lz4hc, and no problems with correctness observed otherwise). Tests done with a 5317MiB (lz4-compressed size) game (The Walking Dead) (with echo 2 > drop_caches done between tests until ARC size dropped to 32M):

lz4hc-compressed copy takes up 4630MiB
making a copy of the lz4-compressed directory with lz4 takes ~12min, with lz4hc - ~13min
the game seems to start a bit faster when it's compressed with lz4hc (max. ~9s difference)

vozhyk- commented 9 years ago

The current code doesn't add any feature flags, but adds a compression=lz4hc property. I'm not sure if this is good, since other ZFS implementations would probably get confused when seeing an unknown compression type. There doesn't seem to be a way to set compression=lz4hc, but have the disk format tell that the blocks can be decompressed with lz4 in current ZFS. Also, LZ4HC supports specifying the compression level (1..16), but currently the default (9) is used.

behlendorf commented 9 years ago

@vozhyk- thanks for working on this. Yes, we may need to add another feature flag to support this even through the existing lz4 implementation can technically decompress it. We want to be very careful about breaking the on-disk format in any way. Although I admit it would be nice to somehow cleanly avoid the need for another feature flag. Clever ideas welcome.

If you could finish any remaining cleanup and open a pull request we can try and get you some feedback.

ilovezfs commented 9 years ago

It seems like the most important thing would be replicating the "gzip | gzip-N" semantics with "lz4 | lz4hc | lz4hc-N."

DeHackEd commented 9 years ago

A dataset property like lz4mode=[lz4 | lz4hc-9 | lz4gc-15] might make sense in that regard. Users may want to choose what compression speeds are on a per-dataset basis.

A module parameter to control L2ARC compression strength is where I was thinking of going with it.

ilovezfs commented 9 years ago

@DeHackEd So legal values involving lz4 would be on (default is now lz4) lz4 lz4hc (== lz4hc-9) lz4hc-1, lz4hc-2, …, lz4hc-16

Any others?

ilovezfs commented 9 years ago

@DeHackEd The CLI should match the CLI for gzip, not introduce a separate mode, unless the same change is also being made for gzip and we're introducing a separate gzip mode dataset property, too.

DeHackEd commented 9 years ago

In that case you'll need a feature flag because the table of compression algorithms used internally will change.

(Not that my suggestion wasn't a dirty hack to begin with. Anything that avoids feature flags is going to be messy).

I'm stepping out.

Bronek commented 9 years ago

@DeHackEd @ilovezfs I was afraid of that, but there does not seem to be sane way out.

ilovezfs commented 9 years ago

@DeHackEd As you said doing it as dataset property could work (assuming it's truly always on-disk compatible), but I'd hope that would be invisible to the user and that zfs create -o compression=lz4hc-13 foo/bar would do the right thing (set compression=lz4 and lz4mode=lz4hc-13), and that zfs get compression foo/bar would return lz4hc-13 not lz4.

Bronek commented 9 years ago

Here is an idea : introduce a new property compressionlevel and make it apply both to gzip and lz4 compression. Actual semantics and validity of compressionlevel will be determined by compression format in context of which it is set.

If compression=gzip-N is set, enforce that compressionlevel is either absent (older pools) or set to the same N (we might also agree on different reconciliation rule, e.g. smaller number wins). If compression=gzip compressionlevel=N is set, make it valid and equivalent to compression=gzip-N .

For lz4 , let the compressionlevel alone decide compression level and style: 0 is lz4 , anything greater than 0 is lz4hc (for example, you might set compression=lz4 compressionlevel=9 to get strong lz4hc compression ; or compression=lz4 compressionlevel=0 to get the default, fastest compression available to ZFS)

For compression formats which do not support choice levels (ie. anything that is not gzip or lz4), this property would be ignored, thus allowing for compression override in parts of filesystem without the necessity to update compressionlevel

The fundamental property of this model is that it decouples compression algorithm (called "level") from compression format , i.e. you can have data with compatible compression format (e.g. gzip, lz4 etc) but written by different algorithms (i.e. "levels", identified by number specific to format)

One downside of this proposal is that we will have to live, for some time, with duplication in gzip compression levels (which I explain at the start). In the face of this downside, perhaps it would make sense to deprecate gzip-N in some future version and at that time issue a new feature flag (i.e. allow import of a pool with "legacy" `compression=gzip-N set, but update to compression=gzip compressionlevel=N property when the pool is upgraded). It would be good to discuss this with other OpenZFS members, if the idea makes sense at all.

Basically, this allows us to postpone creating new feature flag for some time (giving more time for discussion), while allowing to enjoy lz4hc without it.

ilovezfs commented 9 years ago

I think to most users this will feel like a regression:

old syntax: zfs create -o compression=gzip-9 foo/bar

new syntax: zfs create -o compression=gzip -o compressionlevel=9 foo/bar

Call me crazy.

vozhyk- commented 9 years ago

That's what I've been thinking about too (avoiding the use of a feature flag). It appears that properties with : in them can be set without adding any incompatibilities:

zfs set foo.bar:aoeu=htns hts/test

works.

vozhyk- commented 9 years ago

By the way, now I've implemented compression level setting for lz4hc the way it's done for gzip.

vozhyk- commented 9 years ago

A rough benchmark of different lz4hc compression levels and lz4: https://gist.github.com/vozhyk-/1dd8eb04613b641deab2 .

Bronek commented 9 years ago

If we are going to add compression=lz4hc and/or compression=lz4hc-N then new feature flag should be added as well, correct?

vozhyk- commented 9 years ago

We could use something like a org.zfsonlinux:lz4_mode property instead, as @DeHackEd suggested.

Looks like this property can be accessed in zio_write_bp_init() and a different ZIO_COMPRESS_* value passed to zio_compress_data() according to its value (with something like another zio_compress_table).

Anyway, I'm going to clean up the code as much as possible and open a pull request (still without adding a feature flag or doing it this way).

DeHackEd commented 9 years ago

Off-topic, please don't use the @-username notation if you aren't trying to actually get someone's attention with your post. That's what it's for, not for just making a link to a user's profile.

vozhyk- commented 9 years ago

@DeHackEd Sorry about that, will think before making an @-mention next time.

rincebrain commented 10 months ago

Having looked at this, it could be done, but with the integration of zstd, I don't actually know of any use case in my own experiments or seen for anyone else where lz4hc-N or lz4-N would be a better outcome.

It could be done, and refresh the lz4 integrated version for compression while we're at it, but while the decompressor update is a nice improvement, the compression improvements seemed somewhat negligible in all my tests at lz4-0 with 1.9.3, in speed and outcome, and all the nonzero values are, to my recollection, worse than just using an existing zstd value, as a tradeoff.

If people still think this is a useful option and have an example use case where zstd works worse for them, maybe I'll go take another look, but while the work to add it wouldn't be that involved, I couldn't immediately find a use case where it worked better than the existing configuration.

openzfs / zfs

compression=lz4hc support #1900