silx-kit / hdf5plugin

Set of compression filters for h5py
http://www.silx.org/doc/hdf5plugin/latest/
Other
62 stars 22 forks source link

[Question] Is there any filter that can compress np.int(xx)/np.uint(xx) #304

Closed SerGeRybakov closed 1 month ago

SerGeRybakov commented 1 month ago

I'm trying to store rather big arrays with let's say np.int16 or np.uint16 data. And I want to compress them lossless. None of the provided filters helped me in that. As the best result - nothing changed. As the worst - the file size with the "compressed" data was higher than one with raw data.

Isn't that possible at all?

t20100 commented 1 month ago

Depending on the data, it's perfectly possible that some compressors (usually the fast ones) ends up with larger size than the initial data (inefficient compression + necessary extra information stored by the compressor). AFAIR, in this case, some filters are configured to be "optional" and so will store the data uncompressed instead.

Here is an example to illustrate this with the LZ4 compressor outside of HDF5:

import lz4.frame
import numpy as np

data = np.random.randint(255, size=1024, dtype=np.uint8)
compressed = lz4.frame.compress(data)
print(len(compressed), data.nbytes)

-> 1047 1024

One parameter that can affect the efficiency of the compressor and its speed (see #270) is the size of the HDF5 chunks: see Chunked storage in h5py doc. The default chunk size might be rather small.

Finally, depending how they are written to, HDF5 files can accumulate unused space that the h5repack tool will remove:

HDF5 files may accumulate unused space when they are read and rewritten to or if objects are deleted within them ... The h5repack tool can be used to remove unused space in an HDF5 file.

See https://docs.hdfgroup.org/hdf5/develop/_view_tools_edit.html

Could you share a HDF5 file with which you have problem?

SerGeRybakov commented 1 month ago

Thank you, Thomas! I think there is no need to dig into it further, as you've just approved my thoughts on this matter. As a result, there is no possibility to compress integers without loss, as there are lack of options to be compressed. It's better and easier to store them as is.

The only possibility to make a lossless compression of an array of integers - is an array filled with the only value, like [-30, -30, -30, -30, ...] . In this case compression may help. But if the values are different, it won't work.

Thank you very much once again. I'd suggest to close the issue.

vasole commented 1 month ago

As a result, there is no possibility to compress integers without loss, as there are lack of options to be compressed.

If I am not mistaken, the combination Bitshuffle+LZ4 was giving better results than LZ4 alone. Perhaps it is worth to give it a try.

t20100 commented 1 month ago

To be clear, I didn't said that you cannot compress np.int16 or np.uint16 data, only that it varies with the data. Also the different filters offers different speed/compression rate trade-offs. You can check the comparison of the different filters with uint16 data "that compresses well".

To get the maximum compression rate, you probably want to: