Closed ranaivomahaleo closed 8 years ago
What do mean verify it is working? Do you mean verify that the data is compressed? You can just check the length of the output array. Which interface are you using, C, Python or HDF5?
Yes. How to verify that the data is compressed? I am using Python and hdf5 interface.
The command I use to create the dataset is as follows:
import h5py as hdf from h5py import h5f, h5d, h5z, h5t, h5s, filters from bitshuffle import h5
datasetfillpath = '...' f = hdf.File(datasetfullpath, 'w')
filter_pipeline = (32008, 32000) filter_opts = ((1000000, h5.H5_COMPRESS_LZ4), ()) h5.create_dataset(f, 'dataset_name', (20000, 9801, 200), np.float32, chunks=(50, 50, 100), filter_pipeline=filter_pipeline, filter_opts=filter_opts)
f[...] = ... f.flush()
The size of the resulting hdf5 file is around 250Gb. I think that it is too big for a compressed file. I expect a file size of 20,000 x 9,801 x 200 x 4 bytes (around 146Gb) for a non compressed file but why de we have 250Gb?
Is there something wrong in my filter configuration above? How to configure additionally GZIP or third-party filter in the filter configuration above (for example having bitshuffle+LZ4, LZ and GZIP as a pipeline).
Just a rectification concerning the comment above: I created three datasets using the script above and stored in a file. So the expected file size is 3 x 20000 x 9801 x 200 = 438Gb. The resulting file has a size of 250Gb. So a compression of 1.75:1. Good but how can I gain more space (smaller file size)?
Okay, a few things:
h5z.FILTER_DEFLATE
) to the pipeline.Thanks for your feed-backs. Just a few questions:
For the fastest varying axis, you have two options: either transpose your data so that the index with length 20000 is the last one, or set your chunk size to be (20000, 1, 1). The former is preferred.
I would just let bitshuffle choose the block. It will be faster.
Yes, integers often compress better (in all compression schemes).
I had a gain of 4:1, the best compression that I had with my data, with the following configurations:
Okay, try one more iteration:
This might not improve ratios much, but will greatly improve speed. Having the third axis be longer in the chunk is key. Bitshuffle needs to see a bunch of these like elements in a row to get good compression. 100 isn't enough. 1000 is the bare minimum.
I actually use bitshuffle. How to verify if LZF compression is working?