martinellimarco / t2sz

Compress a file into a seekable zstd with special handling for .tar archives
GNU General Public License v3.0
42 stars 0 forks source link

Usability for non-tar archives #3

Closed vasi closed 2 years ago

vasi commented 2 years ago

It might be nice to be able to run t2sz against non-tar-format files, eg: t2sz image.ext4 -o image.ext4.sz. It'd be useful for mounting disk images with ratarmount, for example. My tool pixz allows this sort of thing for writing indexed xz files, and I'd love to see the same functionality for zstd.

mxmlnkn commented 2 years ago

I think this also affects very large files inside TARs, which would not be seekable very much with this tool, because as far as I understand it only starts new seekable zstd frames at file boundaries. But seeking inside files should also be a valid use-case after the archive has been mounted with ratarmount. I would love an option to set an enforced size value as opposed to a minimum value. Something more akin to my createMultiFrameZstd script, which is pretty slow. Maybe, I'll add parallel encoding of multiple blocks.

martinellimarco commented 2 years ago

Let's discuss about this issues, you both have really good points.

Maybe it will help if I explain why I decided to implement it like it is in the first place.

The original use case was for an internal project where we were searching to open small files (few MBs) that were stored in big tar archives (few hundreds of GBs) in the smallest amount of time. Zstd offer the best compression ratio for these particular files but the opening time was less then ideal for us, thus the idea was born of having each file compressed in an independent frame and that's why t2sz align one file per frame per default.

Zstd compress each frame independently. The dictionary is not shared among them and it resets each time a new frame is started. This of course limits the compression ratio when you do one file per frame so we decided to pack together a few files per frame. This helps a lot in lowering the compression ratio if the files have many similarities like in our case and since they are all small we don't have a noticeable performance hit while seeking around.

All that said, besides our internal goal t2sz may grow into something different.

Certainly having a fixed size block size may help to split bigger files in the tar archive and increase the seek performance but we will lose the file / frame alignment.

What if it's a max block size?

Imagine having a 150MB file inside the tar and a 100MB max block size. This will result in a 100MB block followed by a 50MB block. Then the next block will be aligned with the start of the next file. Files smaller than max block size will be compressed in a single frame.

@mxmlnkn what do you think? In your opinion is there any benefit for ratarmount in having this alignment?

@vasi t2sz is really ingrained around the tar format at the moment but I could add another "raw input" mode and simply detect if the input is not tar then fallback to this other mode.

I didn't really considered this until now because like you know there are other tools like the seekable compressor in facebook contrib or the script created by @mxmlnkn, plus I'd tought that facebook would have implemented the seekable format in their own zstd tool by now, but it seems they are not doing much with it?

Anyway, assuming I implement a raw mode, in this mode all the reasoning about aligning files to frames doesn't make sense and there should be a fixed block size. It would be also useful to implement a seektable using the official seekable format. In practice this would be a reimplementation of the seekable compressor, what do your think?

Compressing frames in parallel would also be a nice feature.

I'm open to feedbacks and constructive critiques since you guys, being one the author of pixz and the other the author of ratarmount, are much more experienced than me in all of this.

mxmlnkn commented 2 years ago

What if it's a max block size?

Sounds good to me. That's also what I was thinking assuming that we still would want the file boundaries.

@mxmlnkn what do you think? In your opinion is there any benefit for ratarmount in having this alignment?

It should have a benefit but only when reading file contents. In that case it would lower the latency, which would be an important metric when iterating over many small files, yes.

I didn't really considered this until now because like you know there are other tools like the seekable compressor in facebook contrib or the script created by @mxmlnkn, plus I'd tought that facebook would have implemented the seekable format in their own zstd tool by now, but it seems they are not doing much with it?

My current bash script is very slow. But, I'm gonna replace it with a simple Python loop soon, which is a lot faster (one of the bottlenecks was reopening the file with dd for accessing each block). I already used it to speed up generating test files for my benchmarks. I'm even thinking about letting the setuptools of ratarmount install it under some name, so that it can be used out of the box to prepare files for usage with ratarmount.

The zstd issue is still open. And for the contrib script, do you mean this? It looks really experimental. There aren't even compile instructions and the CLI is basically non-existent. But it seems to work when simply calling make inside that example folder itself. Knowing that, it seems to be somewhat of an alternative indeed but it needs more accessibility.

vasi commented 2 years ago

@mxmlnkn If t2sz could do max-size/fixed-size blocks, would it make sense to distribute it with ratarmount? Maybe we could kill two birds with one stone.

mxmlnkn commented 2 years ago

@mxmlnkn If t2sz could do max-size/fixed-size blocks, would it make sense to distribute it with ratarmount? Maybe we could kill two birds with one stone.

From a user-point view it would definitely make sense. But, I think having t2sz written in C/C++ makes it harder to bundle it in a Python package. I'm glad that I finally was able to make the ratarmount package pure Python by moving out the indexed_bzip2 backend into its own package. Therefore, if I were to bundle t2sz with ratarmount, then I'd prefer t2sz being its own pip package and I simple set it as a dependency in ratarmount. But, I honestly think that implementing similar "pure" Python (using python-zstandard, which has wheels for many platforms) functionality would be easier than all that C/Python build/CI glue code and possible install issues when there are no wheels for a platform and possibly libzstd-dev is not installed.

martinellimarco commented 2 years ago

I've pushed on dev branch a new version that support both functionalities.

I'm not fully convinced of the naming yet but at the moment I have defined two modes.

Raw mode works on any files. It is selected by default if the input file name doesn't end in .tar. In this case -s can be used to specify the fixed block size. Eg. t2sz image.ext4 -s 10M -o image.ext4.sz

Tar mode is enabled when the file name ends in .tar and it is the old behaviour. Only in this mode a new -S flag is added to specify the max block size.

I have yet to clean up the README, ignore that for the moment.

I feel the usage prompt is not too clear at the moment. I'd like for it to be understandable by the end users. Any feedback is appreciated. English is not my forte :/

Now that it does support not only tar format I regret naming it tar 2 -seekable zstd hahaha. I need to find another meaning for the t :)

martinellimarco commented 2 years ago

About the distribution with ratarmount, I fully agree with @mxmlnkn here. I also feel it should be better to have a pure Python implementation.

This tool is rather small, it should be easier to rewrite it in Python than deal with all the boilerplate of binding it.

Truth be told I was planning to rewrite it in Rust at some point as an exercise to learn that language, but this is out of score.

martinellimarco commented 2 years ago

And.. multithread is there too :) It requires libzstd >= 1.5.0 or an older version compiler with ZSTD_MULTITHREAD. On my ancient i7-2600 it performs practically like the official zstd compressor. I will test it on more performant machines in a few days.

vasi commented 2 years ago

Hey, that's super exciting!

vasi commented 2 years ago

Does it imply requiring large block sizes? Because we don't have a thread per block, so the blocks will need to be large enough to have muliple small zstd blocks internally so that the workers all get used.

martinellimarco commented 2 years ago

Yes, the block size should be large enough for it to use all the threads. I should document it better.

martinellimarco commented 2 years ago

I've added a note.

Certainly implementing it one thread per block would increase the CPU utilization with arbitrarily smaller block and reduce the overall execution time but for larger blocks I'm also convinced that the internal implementation is faster than anything I could bake.

vasi commented 2 years ago

Yes, and the simplicity of implementation is nice to have as well!

martinellimarco commented 2 years ago

Hahaha, yes, I was already implementing it with pthread when I stumbled upon this parameter.. I didn't expect it to be this easy :)

Now I'd really like to add the seektable following the seekable format specification. Ratarmount can already take advantage of it using the latest version of indexed_zstd I released a few days ago. It will speedup the 1st mount for archives with a very large number of frames.

martinellimarco commented 2 years ago

Done, now it generate the seek table too. In the next few days I will improve / rewrite the documentations and do some serious testing before releasing a new version.

mxmlnkn commented 2 years ago

I didn't really considered this until now because like you know there are other tools like the seekable compressor in facebook contrib or the script created by @mxmlnkn, plus I'd tought that facebook would have implemented the seekable format in their own zstd tool by now, but it seems they are not doing much with it?

Looking randomly at the libarchive release notes, I just now found out about pzstd, which is installed with the zstd package on ubuntu and which seems to helpfully create multiframe zstd archives!

base64 /dev/urandom | head -c $(( 512*1024*1024 )) > small

zstd small
zstd -l small.zst 
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0   385.89 MB     512.00 MB  1.327  XXH64  small.zst

pzstd small
zstd -l small.zst 
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
   130     65   385.89 MB                       XXH64  small.zst

createMultiFrameZstd small $(( 1024*1024 ))
zstd -l small.zst 
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
   512      0   385.90 MB     512.00 MB  1.327  XXH64  small.zst

t2sz -s 1MiB small
zstd -l small.zst
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
   514      1   385.90 MB     512.00 MB  1.327  XXH64  small.zst

I don't know what skips are but only pzstd generates a lot of them.

And as can be seen in the manual, there is no control at all as to how many frames are created or at what size or whether they are created at all.

martinellimarco commented 2 years ago

skips are skippable frames. Basically they are frames that are user-defined and are not to be considered when decompressing. They can be used for any kind of purpose.

I have no idea why pzstd add so many skippable frames, I will look into it. One problem I see with this is that the uncompressed frame size is not recorded. That's legal in zstd, but it's troublesome if you want to seek. Building the seek table is costly as you have to decompress each frame to get their size and know where they end.

The skippable frame added by t2sz is the seek table and you can avoid generating it with -j if you want.

mxmlnkn commented 2 years ago

One problem I see with this is that the uncompressed frame size is not recorded.

Ah, I overlooked that. Of course it would make seeking difficult. Maybe those skippable frames contain the uncompressed size?

martinellimarco commented 2 years ago

Maybe those skippable frames contain the uncompressed size?

Yes, it appears they are doing exactly that here.

mxmlnkn commented 2 years ago

Maybe those skippable frames contain the uncompressed size?

Yes, it appears they are doing exactly that here.

Thanks for looking that up. Maybe supporting that could be a new feature for indexed_zstd, similar to the seek table support.

martinellimarco commented 2 years ago

Yes, I'm already working on it, it is a simple format.

martinellimarco commented 2 years ago

My bad, it is not the uncompressed frame size, but the compressed one. It's not useful.