Provide option or call option allowing bitshuffle with compression enabled to abort when output would exceed input size in bytes

Allow to request from bitshuffle when embedded eg. lz4 compression is enabled to abort when size of compressed output exceeds size of raw input data. This would make bitshuffle filter with lz4 compression enabled allow to behave conformant to the description of the H5Z_FLAG_OPTIONAL flag described in the The [Defining and Querying the Filter Pipeline] (https://docs.hdfgroup.org/hdf5/develop/_f_i_l_t_e_r.html) section in the libhdf5 manual.


Values for flags                Description

H5Z_FLAG_OPTIONAL  If this bit is set then the filter is optional.  If the filter fails (see below) during an H5Dwrite() operation
                                        then the filter is just excluded from the pipeline for the chunk for which it failed; the filter will not
                                        participate in the pipeline during an H5Dread() of the chunk.
                                   This is commonly used for compression filters: if the compression result would be larger than the input
                                       then the compression filter returns failure and the uncompressed data is stored in the file. If this bit is
                                       clear and a filter fails then the H5Dwrite() or H5Dread() also fails.

At least for me it would to me be a more natural and thus expected behaviour that data is only stored compressed in hdf5 file when there is actually a benefit in terms of size from compression. Further i do not consider it the applications task to decide whether to compress a dataset or not to compress a dataset. On the application level that would always be a a wild guess whether the data will be compressible and thus likely will need less bytes when stored in compressed form compared to its uncompressed representation. This decision can only be made when actually compressing the data and figuring whether the extra bytes necessary for header, housekeeping, code-tables and other necessary bits would cause the resulting chunk be smaller than the input or cause it to exceed the input size.

An Example: an array a=numpy.np.array([2,3],dtype=np.float32) covers in non compressed form exactly 8 bytes excluding metadata. When compressing with bitshuffle + lz4 the 8 bytes of data end up in the hdf5 file with the following storage layout as reported by h5dump

      STORAGE_LAYOUT {
         CHUNKED ( 2 )
         SIZE 20 (0.400:1 COMPRESSION)
      }

If i do read that correctly this means the compressed array is expanded by a factor of 2.5 to 20 bytes so one raw byte covers in compressed output 2.5 bytes. I admit this example is very artificial to demonstrate but poorly compressible data may even when preprocessed by bitshuffle filter cover more space in the hdf5 file excluding metadata after application of compression filter compared to when stored in its original form. For example of the gzip filter. The HZ5_filter_deflate function which implements the actual filter implemented in Hz5_Defllate.c from lines 155 down, allocates an output buffer having the same nbytes size as the input data and when libz compress2 returns Z_BUFF_ERROR indicating that all outputbuffer has been used while still some bytes to be processed and stored remain than compression is aborted as the nbytes are exceeded.

I guess that there are special situations where compression has to be used or its use makes sense independent whether the data bytes stored in a dataset are compressible or not. Therefore having a choice whether bithsuflle filter with enabled embedded compression, lz4 or any other supported, should abort when input size is exceeded by compressed output including necessary header and housekeeping bytes required by the filter or should emit compressed output in any case.

kiyo-masui / bitshuffle

Provide option or call option allowing bitshuffle with compression enabled to abort when output would exceed input size in bytes #110