Closed scottstanie closed 1 year ago
For how to add this in, I think there are 2 options:
add the chunks=(256, 256), compression=4, shuffle=True
here
https://github.com/opera-adt/COMPASS/blob/main/src/compass/utils/h5_helpers.py#L93-L94 and add the separate python code after the fact to zero the mantissa bits after initial write
Leave the initial creation as-is, then run the zero_mantissa.py
script I sent to Virginia to zero the mantissa bits and add chunks/compression/shuffle using h5repack
.
@LiangJYu I believe we have addressed this issue. Can you link to the PR where this is done and close the issue? Thanks
@LiangJYu I believe we have addressed this issue. Can you link to the PR where this is done and close the issue? Thanks
Checked for duplicates
Yes - I've already checked
Alternatives considered
Yes - and alternatives don't suffice
Related problems
Adding an HDF5 compression filter will shrink the geocoded burst size on disk from ~700-800Mb down to 100-200MB. Adding chunking will also make partial data access much easier and faster (for instance, if we want to read a partial area from a large stack of bursts).
Describe the feature request
In order to use a compression filter, chunking is required: https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline
We should probably manually specify the chunk size. The default from
chunks=True
in h5py isn't bad, but it would be different for bursts of different sizes.Chunk size
Since entire chunks are always read at once, some considerations for the chunk size:
I did a test of different chunk sizes on a burst. I zeroed the last 10 mantissa bits, used
gzip=4
compression, and varied the chunk size. Original file:Sizes of different compressed chunks:
The reason larger chunks are getting compressed more is that the
shuffle
filter has "more room" to move around bits and get better compression. The overhead from the number of chunks isn't too large (here's adding only chunks, no compression/shuffle)Since 256 x 256 is still below the 1Mb mark, I might recommend the 256 x 256 chunking.
Compression level
For how much GZIP compression to use, the write times/disk space for GZIP=2,...,7 were as follows
Since there's such quickly diminishing returns for the compression level, I might recommend GZIP=3 or 4