kiyo-masui / bitshuffle

Filter for improving compression of typed binary data.
Other
215 stars 76 forks source link

How to verify if LZF compression is working? #20

Closed ranaivomahaleo closed 8 years ago

ranaivomahaleo commented 9 years ago

I actually use bitshuffle. How to verify if LZF compression is working?

kiyo-masui commented 9 years ago

What do mean verify it is working? Do you mean verify that the data is compressed? You can just check the length of the output array. Which interface are you using, C, Python or HDF5?

ranaivomahaleo commented 9 years ago

Yes. How to verify that the data is compressed? I am using Python and hdf5 interface.

The command I use to create the dataset is as follows:

import h5py as hdf from h5py import h5f, h5d, h5z, h5t, h5s, filters from bitshuffle import h5

datasetfillpath = '...' f = hdf.File(datasetfullpath, 'w')

filter_pipeline = (32008, 32000) filter_opts = ((1000000, h5.H5_COMPRESS_LZ4), ()) h5.create_dataset(f, 'dataset_name', (20000, 9801, 200), np.float32, chunks=(50, 50, 100), filter_pipeline=filter_pipeline, filter_opts=filter_opts)

f[...] = ... f.flush()

The size of the resulting hdf5 file is around 250Gb. I think that it is too big for a compressed file. I expect a file size of 20,000 x 9,801 x 200 x 4 bytes (around 146Gb) for a non compressed file but why de we have 250Gb?

Is there something wrong in my filter configuration above? How to configure additionally GZIP or third-party filter in the filter configuration above (for example having bitshuffle+LZ4, LZ and GZIP as a pipeline).

ranaivomahaleo commented 9 years ago

Just a rectification concerning the comment above: I created three datasets using the script above and stored in a file. So the expected file size is 3 x 20000 x 9801 x 200 = 438Gb. The resulting file has a size of 250Gb. So a compression of 1.75:1. Good but how can I gain more space (smaller file size)?

kiyo-masui commented 9 years ago

Okay, a few things:

  1. Your OS probably reports file sizes in Gb = 10^9 bytes, so your data should be 470 Gb.
  2. There shouldn't be a need to additionally compress the compressed data. Adding LZF (32000) to the pipeline will mostly just slow things down and not compress things much over the LZ4 compression built into bitshuffle. That being said, you can in principle add an arbitrary number of filters to the pipeline in the way you have done. For GZIP you need to add the filter number for DEFLATE (h5z.FILTER_DEFLATE) to the pipeline.
  3. Do you care about speed? If not, bitshuffle is not the compressor for you. BZIP2 is ridiculously slow but gets ridiculously high compression ratios. If you don't want to build the BZIP2 hdf5 filter, just try compressing the file on the command line to see what ratios you get. It will be similar to what you would get out of the filter. LZMA is another option, but I'm not sure if a filter exists for it yet.
  4. Bitshuffle works best if the fastest varying axis of the dataset (the one with length 200 in your case) is the one over which the data is most highly correlated. I.e. if the data doesn't change much from element to element.
  5. You have specified a block size for bitshuffle's internal compression of 1000000. You are probably better off not specifying one (set it to 0), but if you do, be sure to make it a multiple of 8. I should probably document this somewhere.
  6. Float data often doesn't compress well (for any compressor) due to numerical noise. I'm not particularly surprised that you are only getting compression ratios of 1.75:1.
ranaivomahaleo commented 9 years ago

Thanks for your feed-backs. Just a few questions:

  1. ok
  2. ok
  3. Yes, I care about speed. I tested LZMA in command line but I do not gain a lot in term of space.
  4. I do not understand the meaning of "the fastest varying axis of the dataset". In my case (the 20,000 x 9801 x 200 dataset), the data is mostly correlated along the first axis (20,000 values). How can I configure the chunk size and bitshuffle to have more efficient compression. (Note: I changed the chunk size of the hdf5 file to (100, 100, 200)).
  5. I changed the chunk size of the hdf5 file to (100, 100, 200) and specified a block-size equals to 64,000,000 for bitshuffle. Is this configuration fine?
  6. Does an integer representation better in this scheme?
kiyo-masui commented 9 years ago

For the fastest varying axis, you have two options: either transpose your data so that the index with length 20000 is the last one, or set your chunk size to be (20000, 1, 1). The former is preferred.

I would just let bitshuffle choose the block. It will be faster.

Yes, integers often compress better (in all compression schemes).

ranaivomahaleo commented 9 years ago

I had a gain of 4:1, the best compression that I had with my data, with the following configurations:

kiyo-masui commented 9 years ago

Okay, try one more iteration:

This might not improve ratios much, but will greatly improve speed. Having the third axis be longer in the chunk is key. Bitshuffle needs to see a bunch of these like elements in a row to get good compression. 100 isn't enough. 1000 is the bare minimum.