Allow to request from bitshuffle when embedded eg. lz4 compression is enabled to abort when size of compressed output exceeds size of raw input data. This would make bitshuffle filter with lz4 compression enabled allow to behave conformant to the description of the H5Z_FLAG_OPTIONAL flag described in the The [Defining and Querying the Filter Pipeline] (https://docs.hdfgroup.org/hdf5/develop/_f_i_l_t_e_r.html) section in the libhdf5 manual.
Values for flags Description
H5Z_FLAG_OPTIONAL If this bit is set then the filter is optional. If the filter fails (see below) during an H5Dwrite() operation
then the filter is just excluded from the pipeline for the chunk for which it failed; the filter will not
participate in the pipeline during an H5Dread() of the chunk.
This is commonly used for compression filters: if the compression result would be larger than the input
then the compression filter returns failure and the uncompressed data is stored in the file. If this bit is
clear and a filter fails then the H5Dwrite() or H5Dread() also fails.
At least for me it would to me be a more natural and thus expected behaviour that data is only stored compressed in hdf5 file when there is actually a benefit in terms of size from compression. Further i do not consider it the applications task to decide whether to compress a dataset or not to compress a dataset. On the application level that would always be a a wild guess whether the data will be compressible and thus likely will need less bytes when stored in compressed form compared to its uncompressed representation. This decision can only be made when actually compressing the data and figuring whether the extra bytes necessary for header, housekeeping, code-tables and other necessary bits would cause the resulting chunk be smaller than the input or cause it to exceed the input size.
An Example: an array a=numpy.np.array([2,3],dtype=np.float32) covers in non compressed form exactly 8 bytes excluding metadata. When compressing with bitshuffle + lz4 the 8 bytes of data end up in the hdf5 file with the following storage layout as reported by h5dump
If i do read that correctly this means the compressed array is expanded by a factor of 2.5 to 20 bytes so one raw byte covers in compressed output 2.5 bytes. I admit this example is very artificial to demonstrate but poorly compressible data may even when preprocessed by bitshuffle filter cover more space in the hdf5 file excluding metadata after application of compression filter compared to when stored in its original form.
For example of the gzip filter. The HZ5_filter_deflate function which implements the actual filter implemented in Hz5_Defllate.c from lines 155 down, allocates an output buffer having the same nbytes size as the input data and when libz compress2 returns Z_BUFF_ERROR indicating that all outputbuffer has been used while still some bytes to be processed and stored remain than compression is aborted as the nbytes are exceeded.
I guess that there are special situations where compression has to be used or its use makes sense independent whether the data bytes stored in a dataset are compressible or not. Therefore having a choice whether bithsuflle filter with enabled embedded compression, lz4 or any other supported, should abort when input size is exceeded by compressed output including necessary header and housekeeping bytes required by the filter or should emit compressed output in any case.
Allow to request from bitshuffle when embedded eg. lz4 compression is enabled to abort when size of compressed output exceeds size of raw input data. This would make bitshuffle filter with lz4 compression enabled allow to behave conformant to the description of the
H5Z_FLAG_OPTIONAL
flag described in the The [Defining and Querying the Filter Pipeline] (https://docs.hdfgroup.org/hdf5/develop/_f_i_l_t_e_r.html) section in the libhdf5 manual.At least for me it would to me be a more natural and thus expected behaviour that data is only stored compressed in hdf5 file when there is actually a benefit in terms of size from compression. Further i do not consider it the applications task to decide whether to compress a dataset or not to compress a dataset. On the application level that would always be a a wild guess whether the data will be compressible and thus likely will need less bytes when stored in compressed form compared to its uncompressed representation. This decision can only be made when actually compressing the data and figuring whether the extra bytes necessary for header, housekeeping, code-tables and other necessary bits would cause the resulting chunk be smaller than the input or cause it to exceed the input size.
An Example: an array
a=numpy.np.array([2,3],dtype=np.float32)
covers in non compressed form exactly 8 bytes excluding metadata. When compressing with bitshuffle + lz4 the 8 bytes of data end up in the hdf5 file with the following storage layout as reported by h5dumpIf i do read that correctly this means the compressed array is expanded by a factor of 2.5 to 20 bytes so one raw byte covers in compressed output 2.5 bytes. I admit this example is very artificial to demonstrate but poorly compressible data may even when preprocessed by bitshuffle filter cover more space in the hdf5 file excluding metadata after application of compression filter compared to when stored in its original form. For example of the gzip filter. The HZ5_filter_deflate function which implements the actual filter implemented in Hz5_Defllate.c from lines 155 down, allocates an output buffer having the same nbytes size as the input data and when libz
compress2
returnsZ_BUFF_ERROR
indicating that all outputbuffer has been used while still some bytes to be processed and stored remain than compression is aborted as the nbytes are exceeded.I guess that there are special situations where compression has to be used or its use makes sense independent whether the data bytes stored in a dataset are compressible or not. Therefore having a choice whether bithsuflle filter with enabled embedded compression, lz4 or any other supported, should abort when input size is exceeded by compressed output including necessary header and housekeeping bytes required by the filter or should emit compressed output in any case.