opera-adt / COMPASS

COregistered Multi-temPorAl Sar Slc
Apache License 2.0
39 stars 18 forks source link

[New Feature]: Add chunking + compression #79

Closed scottstanie closed 1 year ago

scottstanie commented 1 year ago

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Related problems

Adding an HDF5 compression filter will shrink the geocoded burst size on disk from ~700-800Mb down to 100-200MB. Adding chunking will also make partial data access much easier and faster (for instance, if we want to read a partial area from a large stack of bursts).

Describe the feature request

In order to use a compression filter, chunking is required: https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline

We should probably manually specify the chunk size. The default from chunks=True in h5py isn't bad, but it would be different for bursts of different sizes.

Chunk size

Since entire chunks are always read at once, some considerations for the chunk size:

I did a test of different chunk sizes on a burst. I zeroed the last 10 mantissa bits, used gzip=4 compression, and varied the chunk size. Original file:

-rwxr-xr-x 1 staniewi users 713M Jan  4 08:53 t064_135519_iw1_20220501_VV.h5

Sizes of different compressed chunks:

$ for c in 16 32 64 128 256 512; do h5repack -f science/SENTINEL1/CSLC/grids/VV:SHUF -l science/SENTINEL1/CSLC/grids/VV:CHUNK=${c}x${c} -f science/SENTINEL1/CSLC/grids/VV:GZIP=4 t064_135519_iw1_20220501_VV.h5 test_chunk_${c}.h5; echo "Done with $c"; done

...

$ ls -lh
-rw-r--r-- 1 staniewi users 130M Jan  4 09:21 test_chunk_512.h5
-rw-r--r-- 1 staniewi users 132M Jan  4 09:00 test_chunk_256.h5
-rw-r--r-- 1 staniewi users 138M Jan  4 09:00 test_chunk_128.h5
-rw-r--r-- 1 staniewi users 146M Jan  4 08:59 test_chunk_64.h5
-rw-r--r-- 1 staniewi users 155M Jan  4 08:59 test_chunk_32.h5
-rw-r--r-- 1 staniewi users 187M Jan  4 08:59 test_chunk_16.h5

The reason larger chunks are getting compressed more is that the shuffle filter has "more room" to move around bits and get better compression. The overhead from the number of chunks isn't too large (here's adding only chunks, no compression/shuffle)

-rw-r--r-- 1 staniewi users 724M Jan  4 09:13 test_chunk_128_only.h5
-rw-r--r-- 1 staniewi users 719M Jan  4 09:13 test_chunk_32_only.h5

Since 256 x 256 is still below the 1Mb mark, I might recommend the 256 x 256 chunking.

Compression level

For how much GZIP compression to use, the write times/disk space for GZIP=2,...,7 were as follows

$ for g in 2 3 4 5 6 7; do time h5repack -v2 -f science/SENTINEL1/CSLC/grids/VV:SHUF -l science/SENTINEL1/CSLC/grids/VV:CHUNK=256x256 -f science/SENTINEL1/CSLC/grids/VV:GZIP=${g} t064_135519_iw1_20220501_VV.h5 test_chunk_256_gzip_${g}.h5 >> timings.log; done

GZIP=2
repack timing   0m13.249s
135MB

GZIP=3
repack timing   0m16.391s
134MB

GZIP=4
repack timing   0m19.035s
132MB

GZIP=5
repack timing   0m25.464s
132MB

GZIP=6
repack timing   0m42.260s
130MB

GZIP=7
repack timing   1m7.643s
129MB

Since there's such quickly diminishing returns for the compression level, I might recommend GZIP=3 or 4

scottstanie commented 1 year ago

For how to add this in, I think there are 2 options:

  1. add the chunks=(256, 256), compression=4, shuffle=True here https://github.com/opera-adt/COMPASS/blob/main/src/compass/utils/h5_helpers.py#L93-L94 and add the separate python code after the fact to zero the mantissa bits after initial write

  2. Leave the initial creation as-is, then run the zero_mantissa.py script I sent to Virginia to zero the mantissa bits and add chunks/compression/shuffle using h5repack.

vbrancat commented 1 year ago

@LiangJYu I believe we have addressed this issue. Can you link to the PR where this is done and close the issue? Thanks

LiangJYu commented 1 year ago

@LiangJYu I believe we have addressed this issue. Can you link to the PR where this is done and close the issue? Thanks

127 adds chunking and compression