scottstanie commented 1 year ago

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Describe the feature request

In order to use a compression filter, chunking is required: https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline

We should probably manually specify the chunk size. The default from chunks=True in h5py isn't bad, but it would be different for bursts of different sizes.

Chunk size

Since entire chunks are always read at once, some considerations for the chunk size:

General recommendataions say aim for chunks between 10 Kb and 1 Mb
Having smaller chunks is good if users would access very small pieces of the image, to avoid reading in unused data. However, I don't foresee people reading in much smaller areas than 128x128 or 256x256
AFAIK, the only downside for large chunks is that you can't fit many into the HDF5 chunk cache https://docs.h5py.org/en/stable/high/file.html?highlight=chunk%20size#chunk-cache (which is 1MB by default, but can be increased). This chunk cache is a benefit when you may re-read in parts of a dataset from the same chunk multiple times... but I don't picture that as a common use case for us/end users.
- for the displacement workflow, we'll probably make one pass over the whole dataset. we won't keep randomly revisting nearby pixels within the same chunk.

I did a test of different chunk sizes on a burst. I zeroed the last 10 mantissa bits, used gzip=4 compression, and varied the chunk size. Original file:

-rwxr-xr-x 1 staniewi users 713M Jan  4 08:53 t064_135519_iw1_20220501_VV.h5

Sizes of different compressed chunks:

$ for c in 16 32 64 128 256 512; do h5repack -f science/SENTINEL1/CSLC/grids/VV:SHUF -l science/SENTINEL1/CSLC/grids/VV:CHUNK=${c}x${c} -f science/SENTINEL1/CSLC/grids/VV:GZIP=4 t064_135519_iw1_20220501_VV.h5 test_chunk_${c}.h5; echo "Done with $c"; done

...

$ ls -lh
-rw-r--r-- 1 staniewi users 130M Jan  4 09:21 test_chunk_512.h5
-rw-r--r-- 1 staniewi users 132M Jan  4 09:00 test_chunk_256.h5
-rw-r--r-- 1 staniewi users 138M Jan  4 09:00 test_chunk_128.h5
-rw-r--r-- 1 staniewi users 146M Jan  4 08:59 test_chunk_64.h5
-rw-r--r-- 1 staniewi users 155M Jan  4 08:59 test_chunk_32.h5
-rw-r--r-- 1 staniewi users 187M Jan  4 08:59 test_chunk_16.h5

The reason larger chunks are getting compressed more is that the shuffle filter has "more room" to move around bits and get better compression. The overhead from the number of chunks isn't too large (here's adding only chunks, no compression/shuffle)

-rw-r--r-- 1 staniewi users 724M Jan  4 09:13 test_chunk_128_only.h5
-rw-r--r-- 1 staniewi users 719M Jan  4 09:13 test_chunk_32_only.h5

Since 256 x 256 is still below the 1Mb mark, I might recommend the 256 x 256 chunking.

Compression level

For how much GZIP compression to use, the write times/disk space for GZIP=2,...,7 were as follows

$ for g in 2 3 4 5 6 7; do time h5repack -v2 -f science/SENTINEL1/CSLC/grids/VV:SHUF -l science/SENTINEL1/CSLC/grids/VV:CHUNK=256x256 -f science/SENTINEL1/CSLC/grids/VV:GZIP=${g} t064_135519_iw1_20220501_VV.h5 test_chunk_256_gzip_${g}.h5 >> timings.log; done

GZIP=2
repack timing   0m13.249s
135MB

GZIP=3
repack timing   0m16.391s
134MB

GZIP=4
repack timing   0m19.035s
132MB

GZIP=5
repack timing   0m25.464s
132MB

GZIP=6
repack timing   0m42.260s
130MB

GZIP=7
repack timing   1m7.643s
129MB

Since there's such quickly diminishing returns for the compression level, I might recommend GZIP=3 or 4

scottstanie commented 1 year ago

For how to add this in, I think there are 2 options:

add the chunks=(256, 256), compression=4, shuffle=True here https://github.com/opera-adt/COMPASS/blob/main/src/compass/utils/h5_helpers.py#L93-L94 and add the separate python code after the fact to zero the mantissa bits after initial write
Leave the initial creation as-is, then run the zero_mantissa.py script I sent to Virginia to zero the mantissa bits and add chunks/compression/shuffle using h5repack.

vbrancat commented 1 year ago

@LiangJYu I believe we have addressed this issue. Can you link to the PR where this is done and close the issue? Thanks

LiangJYu commented 1 year ago

@LiangJYu I believe we have addressed this issue. Can you link to the PR where this is done and close the issue? Thanks

127 adds chunking and compression

opera-adt / COMPASS

[New Feature]: Add chunking + compression #79