observingClouds / xbitinfo

Python wrapper of BitInformation.jl to easily compress xarray datasets based on their information content
https://xbitinfo.readthedocs.io
MIT License
52 stars 21 forks source link

How to apply bitrounding to `MPIESM` `grb` #67

Closed aaronspring closed 2 years ago

aaronspring commented 2 years ago

Goal:

Issue:

Solution ideas:

_Originally posted by @aaronspring in https://github.com/observingClouds/bitinformation_pipeline/issues/44#issuecomment-1104422757_

Related:

observingClouds commented 2 years ago

I think it does not make much sense to apply bitrounding to grib files because often a lossy compression has already been applied and the data templates that grib support do not allow floats (only integer offsets to reference) nor supports compressors with bit shuffling, which makes bitrounding ineffective. (see WMO Manual on Codes)

When post-processing grib files, I would try to save the output in a more user friendly format like NetCDF or Zarr.

milankl commented 2 years ago

While I share your views on user friendliness with grib, note that bitinformation(::Array) is defined for any bittypes, meaning that you can also pass on linearly quantized data for example.

julia> using BitInformation, LinLogQuantization
julia> A = rand(Float32,100);
julia> sort!(A);
julia> B = LinQuant16Array(A)    # convert to a 16-bit linearly quantized array with offset
100-element LinQuantArray{UInt16, 1}:
 0x0000
 0x0006
 0x0042
 0x0452
 0x0472
 0x063a
 0x0c77
 0x10c8
 0x1792
 0x1a0b
      ⋮
 0xed32
 0xeda3
 0xefdc
 0xefdd
 0xf461
 0xf5ca
 0xf7b8
 0xfefc
 0xffff

julia> bitinformation(B.A)    # the actual uint16-array in B is called A (bad choice I know)
16-element Vector{Float64}:
 0.9273256566659501
 0.7992766062258696
 0.622526732096593
 0.3845423259605024
 0.09759283215962768
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

Hence, this would suggest that a 5-bit linear quantization is sufficient. One could then keep bits 6-16 as zeros and appy lossless compression, or pack directly 5-bit unsigned integers together (if that's supported in grib, I don't know). Lossless compression might be a good idea in general as the second method wouldn't compress away correlations in the data.