openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
823 stars 72 forks source link

[feature request] better HFS compression support #654

Open RJVB opened 6 years ago

RJVB commented 6 years ago

(continuation of a topic launched initially on the forum)

O3x currently has transparent support for HFS compression that, as far as I understand, is a bit of a scam in that it accepts HFS-compressed files (say, from the afsctool utility), decompresses the data behind the scenes and writes it to the target dataset according to its current settings. (True decmpfs compression where the compressed data fits in a HFS attribute may be handled differently?) There may be scenarios where this is required (when mimicking HFS) but for all practical purposes this is identical to what the OS does itself when copying data to a target that does not support HFS compression.

It would be more efficient (space and time) if O3x were capable of writing the compressed data "as is" at least when zlib-based HFS compression is used. After all ZFS datasets can use zlib compression, and a single dataset can contain files compressed with all supported compression types and levels, all of which can be read. AFAIK you don't need to know the zlib compression level to decompress the data, so it should be possible to take zlib-compressed file content and commit it to disk, labelling the file as compressed by zlib and not whatever compression is configured for the dataset.

The result would be a combination of the best of both FS compression approaches: a fast (LZ4) algorithm can be used for transparent/online compression, while a more space-efficient but (much) more time-expensive algorithm can be used "offline" at opportune moments, for instance to minimise the footprint of a large build folder after builds.

I'd be more than willing to adapt my version of afsctool to accommodate such a feature in ZFS. The alternative approach would of course be to allow file-level control of the ZFS compression to be used, opening the way for a "zfsctool" equivalent. More expensive when copying already HFS-compressed files but easier to incorporate upstream.

lundman commented 6 years ago

HFS decmpfs will compress the data in a file, into an EA named com.apple.decmpfs, after which it truncates the file's data to 0 bytes length. decmpfs uses zlib based algorithm, just as ZFS does, (obsolete lzjb, lz4 and gzip (1-9) are all zlib based).

It is true that ZFS "handles" this by letting the software create the EA com.apple.decmpfs, and lets it truncate the file to 0 bytes, and set the file flags COMPRESSED. After that, ZFS decompresses the EA com.apple.decmpfs back to regular file data.

The only reason we added this extra work is because a lot of software ignored the filesystem capabilities reply we return, where we say we do NOT support decmpfs. Including Apple's own "Mail" app. This is obviously rather disappointing that Apple ignores their own API to query capabilities, then proceeds to use one not listed. But, it is what it is.

The reason we uncompress it, is so that the user can use the ZFS options with compression, checksum, encryption and dedup. If you want almost-identical level of compression to decmpfs, you set compression=gzip-9. /usr/share/dict/words of 2493109 bytes, compresses to 757310 with decmpfs, and 763904 with gzip-9. (30.37% vs 30.64%)

The biggest argument against decmpfs is probably that it is file-based. If you want to write/append a few more bytes to the end of a file (which is the most common scenario if you are going to write to a file - but actually, any changes at all, writes, truncate, extend) then HFS has to decompress the entire file back into file data - remove the EA, then continue to the write operation. The file will lose its compression.

In ZFS, which uses block based compression, it will decompress only the last block into ARC, then let you write more data. Once it completes a write of blocksize, that block will again be compressed, and written to disk. As any further data written will be. The file retains its compression.

But perhaps file-based compression is still desired, it has been discussed over in ZOL I believe, but there has yet been no action on it, I think perhaps because there are so few advantages.

If we wanted to keep decmpfs compressed files, we would need to add logic to a few places around the ZFS source, getattr should return the real filesize (instead of the 0 bytes data length, and not the logical compressed size of the xattr). Reading the data would need to decompress the EA into ARC first - entirely, since it is not block-based, and Writing needs to undo the compression completely (this code already exists since it is what we do now). Then you need to add support to zfs send, so we do not try to send decmpfs files (non-osx pools would not be able to read the file data - from not knowing about the EA to read, and that decmpfs implementation is part of XNU sources).

zstd compression is also coming to ZFS, giving a whole additional compression options.

So I'm not against the idea, so perhaps you can still sell me on it, it just adds code complexion with almost zero advantages. But we should definitely implement clonefile though, that is needed for compatibility.

RJVB commented 6 years ago

I think there's a misunderstanding here, or I wasn't clear.

Some questions: I presume you preserve the decmpfs-related attribute(s) but that you don't preserve the file-content-in-an-attribute aspect? How do other ZFS implementations see those attributes? Do those broken Apple apps also try to decmpfs files to non HFS/APFS filesystems, e.g. when O3X doesn't mimick HFS?

I'm not suggesting to implement a decmpfs feature in ZFS. I have 2 ideas in mind, which maybe I presented in the wrong order. Let me try again.

  1. file-level control of the ZFS compression property, probably only via some low-level API, so that it becomes possible to write individual new files with a different compression. Say, the dataset uses LZ4 but you know the file you're about to write will compress significantly better with a different algorithm (and you don't care about write performance in this case), OR you know the file is a huge audio, video or archive file that is already compressed. This evidently includes rewriting an existing file completely, so a utility that takes a bunch of existing files and rewrites them with a different compression (as afsctool does) becomes possible and of potential interest, and such a utility would probably be the main justification.
  2. the decmpfs idea I mentioned takes this 1 step further: once 1) is implemented it probably becomes possible to accept files that already have decmpfs compression, preserve the compressed data and store that directly as a regular file on the dataset instead of (and as if) what you're doing now (decompress, then recompress). In other words, the file should be indistinguisheable from those you get currently when applying decmpfs to O3X, with 2 advantages: a) no waste of CPU cycles to compress, decompress, recompress and b) you get the expected level of compression, not the kind and level currently configured on the dataset.

Re 1: this is equivalent to changing the dataset compression, writing a new file, restoring the dataset compression (but without the risk that other files are concerned by the compression change). Appending data to that new file after compression restore will work the same way as it does currently: either it preserves the compression type and level that was active when the file was created, or it rewrites the entire file with the current compression settings (which one is it, in fact?). Preservation would be great, rewriting would correspond to what HFS does.

Re the recompression utility: I use my improved afsctool utility very intensively in my development workflow to keep build directories to a "reasonable" size so I can preserve them (think Qt5, or the work directory containing all KF5 software I tinker with). After each batch-controlled build (through the MacPorts driver to be exhaustive) I run the utility on the build dir and on the ccache directory; I also apply it to big source trees. This allows me to keep my entire MacPorts build directory on an SSD that's only 64Gb. I use the same workflow on ZFS under Linux and I see the big difference in space economy between decmpfs/ZIP-8 and LZ4. I'd love to use ZFS instead of HFS+ on that SSD but the content simply wouldn't fit and ZIP-8 compression would increase my build times by too much.

Being able to apply ZIP-8 to replace LZ4 selectively would be a best-of-both worlds solution. Cf. a cheap continuous online defragmentation feature coupled with a more powerful offline defragmentation step (for those who remember what fragmentation does to regular file systems ;) )

Side-ways related: LZ4 also has different compression levels (and a dictionary feature) exposure of (control over) which could be beneficial.

lundman commented 6 years ago

Some questions: I presume you preserve the decmpfs-related attribute(s) but that you don't preserve the file-content-in-an-attribute aspect?

I reserved a bit for decmpfs: #define ZFS_COMPRESSED 0x0020000000000000ull which is immediately removed when ZFS decompresses it. The file will look "non-compressed" to any OSX tools. Flag is not upstreamed as it is not really used. I believe I can re-write the code to not have a flag even, maybe.

How do other ZFS implementations see those attributes?

That high-bit would be set, but they'd ignore it, unless they added logic to handle it. (Or worse, also pick same bit for something else :) )

Do those broken Apple apps also try to decmpfs files to non HFS/APFS filesystems, e.g. when O3X doesn't mimick HFS?

Yes, ZFS and NTFS - if you say you handle VOL_CAP_INT_EXTENDED_ATTR, even without VOL_CAP_FMT_DECMPFS_COMPRESSION it will blindly decmpfs compress files. Worse than that, they ignore all error return codes, so you can not make it fail. Refusing to create the XATTR, or, setting the UF_COMPRESSED flag, or truncating the file, all ignored. They just carry on. So if you do not handle the XATTR being written, you just lost the data in that file when it is truncated. Who writes code like that?

file-level control of the ZFS compression property,

Ok, so ignoring decmpfs for a bit, you are toying with the idea of being able to change a files compression from lz4 to gzip9 (for example), and the file would be recompressed as desired. Potentially this could be done, since the compression type used is stored in the file's metadata. Although, like Ahrens talked about in his BSDcan video, as one of the things they feel they fell short of the target, is that when changing compression, it would be neat if a resilver process started, and ZFS recompressed existing data. I think upstream would welcome such a project.

But in your situation with keeping build trees, could you not simply build in a tree with lz4 for speed, then when you want to archive it, zfs send | zfs receive it into a dataset with compression=gzip9. Instead of copying the tree, then using acpfstool to recompress everything, you would simply run one command to create your archive?

2. the decmpfs idea I mentioned

This is where it gets tricky. ZFS implemented compression correctly, it is part of the IO pipe the whole way. The data is still stored in the file's data, and expanded when read. HFS's decmpfs is a bit of a hack, added on decades later, that stores the data in a xattr, and has a bunch of code to "redirect" the reading of data, to the xattr instead. (why it wasn't done "properly" in apfs is an interesting question? Perhaps because they couldn't stop Apps from ploughing on with decmpfs :) )

You might need to add ZFS compression=decmpfs support to all platforms. But I am not certain, I have not looked at the depths of decmpfs sources in XNU. But if you look at FreeBSD's libarchive, you can see it does use zlib as foundation, but uses a different chunk processing: https://github.com/freebsd/freebsd/blob/45410cb9f8b6274751628effb8c39a55ac02a4d7/contrib/libarchive/libarchive/archive_write_disk_posix.c#L1002

I don't know enough of zlib to be able to tell, but best guess is ZFS would need extra logic to be able to use the decmpfs data if stored "raw".

So, keeping decmpfs data in ZFS will never work with all platforms in mind. It is true that we could make a pool feature flag for OSX only, and if decmpfs is used, you can not use that dataset on other platform implementations. But seeing as you can't stop the Applications from using decmpfs, it would have to be a manual toggle for the user to set on the dataset. (To let decmpfs stay, or to immediately recompress).

a) no waste of CPU cycles to compress, decompress, recompress

I think it is worth keeping in mind that decmpfs will "waste cpu" if you want to make any change to the file, which will cost you more when you don't want it to (as you are just writing to it). If you really do archive with "read only" in mind, you could go ever further and just use bzip-9 on the tree, and completely win. But I suspect you want the tree to remain "readable" as-is.

But lesse now, yes, I think a compress-one-file style tool probably could be implemented somehow. If you could pass the desired compression, either in the open/create call, or immediately after in fcntl/ioctl, before writing data, the file's metadata could have the desired compression set (and pool features enabled when required). The rest would be automatic.

That is for a new file. Your userland tool would take care of re-writing files for existing files.

RJVB commented 6 years ago

This was meant as a quick reply but finally I had time to react to most points :)

Who writes code like that?

Good question. Do you have a radar bug ticket and how does that go (I might know someone at Apple who could "champion" it)?

Which other apps are concerned?

Although, like Ahrens talked about in his BSDcan video, as one of the things they feel they fell short of the target, is that when changing compression, it would be neat if a resilver process started, and ZFS recompressed existing data. I think upstream would welcome such a project.

I hope they'd make that an optional behaviour.

... then when you want to archive it, zfs send | zfs receive it into a dataset with compression=gzip9. Instead of copying the tree, then using acpfstool to recompress everything, you would simply run one command to create your archive?

I could, but it'd change my workflow significantly. A different dataset means moving the data to another location (mountpoint), which is not what I want. To stick with the build dir example: you'd want to be able to re-run an incremental build in that same directory. If not I could just use a nice little archiving utility that sticks files into a compressed archive, deleting the originals as it goes. Also, how is zfs send | zfs receive a one (1) command and more convenient than afsctool [options] <directory>? :)

Ages ago (when an 80Mb hdd was big) I wrote a compiler wrapper script that was able to compress object files after the compilation, and decompress them when needed for linking. I don't want to go there again, that kind of thing should be handled by the OS and/or filesystem....

No, the null approach to test how well this might work would be to write the zfsctool little brother to afsctool, which uses existing means to change the dataset compression for the context of rewriting the specified files and then restores everything (and leaves it up to the user not to start any other activity on that dataset in the meantime).

Or maybe that utility should have resilver in its name because evidently the rewritten files would take the current setting of all attributes into account (I've often wanted to have something similar for the copies attributes, for instance). But I'd want to have a way to specify attribute settings

%>zresilver -o compression=XX -o checksum=YY [-o foo=ZZ] file1|dir1 [file2|dir2 ...]

2. the decmpfs idea I mentioned

To be clear, I am sloppy in my naming and tend to use decmpfs as a pars-pro-totem name for the complete HFS compression circus. Evidently you'd not implement the file-data-in-an-attribute feature in ZFS unless there''s some advantage to having only a directory entry for tiny compressed files.

Couldn't such a hypothetical advantage be the reason they maintained it in apfs, after all they could probably have done without it no matter how hacked-on it was? Cf. apps like DiskWarrior which exist to reorganise the directory tree, and AFAIK only on Mac.

You might need to add ZFS compression=decmpfs support to all platforms.

Not really what I had in mind, my basic idea is really that the resulting files would be identical to what you can get currently (except for any simulated attributes). If HFS uses a different kind of zlib compression that needs different decompression (and this isn't handled internally by zlib) then my idea of not having to recompress is probably not feasible.

But lesse now, yes, I think a compress-one-file style tool probably could be implemented somehow. If you could pass the desired compression, either in the open/create call, or immediately after in fcntl/ioctl, before writing data, the file's metadata could have the desired compression set (and pool features enabled when required). The rest would be automatic.

That sounds encouraging; one could hope that no one would object to the idea of having a programmatic API to set dataset attributes. Why not add a dedicated function to libzfs?

That is for a new file. Your userland tool would take care of re-writing files for existing files.

Indeed, as afsctool does currently. The tool would probably just write a new copy of the file and then move it into place. That way it'd be safe to use too on shared libraries that are in use (and you'd want a backup anyway; my afsctool version can even maintain a 2nd backup file for extra-sensitive stuff). Risk levels might be a bit less important when compression is handled in the FS though.

RJVB commented 6 years ago

On Tuesday August 07 2018 17:40:05 Jorgen Lundman wrote:

Who writes code like that?

Come to think of it: Microsoft ;) I can remember 2 examples from the MSWin 9x days: 1) SAMBA errors not being checked (a remote server or disk could disappear and the client would happily continue writing) and 2) CD-RW errors not being checked until the data was to be committed to the optical. Both caused data loss to colleagues of mine in the late 90s.

Good company, eh?

RJVB commented 6 years ago

I'd like to write a PoC demonstrator that can rewrite files with a different set of properties (or at least the compression property).

Is there a minimal example somewhere that shows how to get and set a dataset property via libzfs?

I had a quick look at zfs_main.c to see how zfs get and zfs set do it, but while that's well-written code it isn't easy to parse with all the option parsing going on.

Something else: would there be a point in doing this with the multi-threaded approach I use in my afsctool version (have N threads that each take a file to be rewritten off the list). Or is that by definition not going to give any processing time reduction even on fast drives, once you add the mutexes required to make things safe?

Thanks, R.

RJVB commented 5 years ago

There now is a working PoC: zfsctool (https://openzfsonosx.org/forum/viewtopic.php?p=9287#p9287).