lspestrip / striptease

Strip TEst Analysis for System Evaluation
MIT License
4 stars 3 forks source link

Add support for compressed HDF5 files #63

Closed ziotom78 closed 2 years ago

ziotom78 commented 2 years ago

From a few tests I did, it seems that we can gain significant advantages in using compression schemes against HDF5 files.

The HDF5 file format implements in-file compression, but the supported schemes are only a few (gzip being the most interesting one). This PR lets the DataFile class read HDF5 files compressed using any of the following programs:

I did several tests considering the following factors:

From the few tests I did, the best compressor is xz, which however takes ~20 minutes to compress a file (decompressing it is a matter of a few seconds). The best compromise is zstd, which produces files a bit larger than xz, but both compression and decompression are very fast (usually 10÷30 seconds for compression, ~4÷5 seconds for decompression). I would not bother with gzip, which produces sub-optimal compression ratios (.gz files are up to ×10 times larger than .zst and .xz files), and bzip2, which produces files only marginally smaller than gzip at the expense of a significantly larger processing time (several minutes).

Assuming we adopt zstd as the standard compressor, we can expect these results:

With this PR in place, you can compress HDF5 files from the command line with commands like:

$ zstd my_file.h5  # This creates file "my_file.h5.zst"

and then read them using

from striptease import DataFile

with DataFile("my_file.h5.zst") as inpf:
    # Use inpf as a normal DataFile
    …

The PR fixes hdf5db.py too, so that databases of HDF5 files include compressed files as well.