milankl / BitInformation.jl

Information between bits and bytes.
MIT License
30 stars 3 forks source link

BitInformation for data previously reduced in precision #46

Open rsignell-usgs opened 1 year ago

rsignell-usgs commented 1 year ago

We have a large dataset from WRF, where the providers already reduced precision on some variables before passing to us. For example:

ALBEDO:number_of_significant_digits = 5 ;          
LAI:number_of_significant_digits = 3 ;
LH:number_of_significant_digits = 5 ;   
...

Can we use this knowledge somehow to create an alternate algorithm for calculating the appropriate keepbits?

milankl commented 1 year ago

Do you know whether they rounded in decimal or in binary? Because while rounding in one base also quantizes in the other it doesn't round (as in make the trailing digits/bits shorter/simpler, the actual idea of rounding) in the other. Example:

julia> a = randn(Float32,5)
5-element Vector{Float32}:
 -0.35390654
  1.223104
 -0.04762434
 -1.5507344
  0.89877653

julia> b = round.(a,digits=3)
5-element Vector{Float32}:
 -0.354
  1.223
 -0.048
 -1.551
  0.899

julia> bitstring.(b,:split)
5-element Vector{String}:
 "1 01111101 01101010011111101111101"
 "0 01111111 00111001000101101000100"
 "1 01111010 10001001001101110100110"
 "1 01111111 10001101000011100101011"
 "0 01111110 11001100010010011011101"

While the mantissa bits aren't all zero after some mantissa bits, the resulting array may still be more compressible as the number of possible mantissa bitpatterns still have been greatly reduced.

But to actually answer your question, you can (roughly) translate between significant digits and bits via

nsb(nsd::Integer) = Integer(ceil(log(10)/log(2)*nsd))

which just arises from the idea that with $d$ digits you have a maximum absolute error of $\tfrac{10^{-d}}{2}$, and with $b$ bits you have a maximum absolute error of $\tfrac{2^{-b}}{2}$, equating the two and rounding up rather than down gives you the number of bits you need to have an error that isn't greater than the error from rounding in decimal. There are some caveats that @czender can elaborate more on.

In your case this means that if you already know that your dataset contains only $d$ digits of precision you can definitely round to $d$ digits $b$ bits
1 4
2 7
3 10
4 14
5 17
6 20
7 24

without losing any information. And in fact, you probably want to because if there's real information in the mantissa bits past those it's artificial from the quantization.

milankl commented 1 year ago

In fact, we already discussed that here https://github.com/nco/nco/issues/250

rsignell-usgs commented 1 year ago

@pnorton-usgs and I checked, and these variables were processed with NCO using args like ppc ALBEDO:5, which specifies the Number of Significant Digits (as opposed to the Decimal Significant Digits).

So a few values look like this:

import struct

a = ds['ALBEDO'][100,500:510,600].values

a
array([0.2371893 , 0.22657919, 0.22525072, 0.21817589, 0.20056486,
       0.2204647 , 0.22222233, 0.22418547, 0.22438478, 0.22910786],
      dtype=float32)
def binary(num):
    return ''.join('{:0>8b}'.format(c) for c in struct.pack('!f', num))

for v in list(a):
    print(binary(v))
00111110011100101110000111000000
00111110011010000000010001100000
00111110011001101010100000100000
00111110010111110110100110000000
00111110010011010110000011100000
00111110011000011100000110000000
00111110011000111000111001000000
00111110011001011001000011100000
00111110011001011100010100100000
00111110011010101001101101000000
milankl commented 1 year ago

Just checked what Charlie means by NSD vs DSD, but the former is a relative error vs the latter is an absolute error. Great that you used nco's ppc option, because then rounding is actually done in binary. Depending on the version of nco this should also do granular bitrounding (see here for more on this) meaning that keepbits is somewhat variable from value to value. I guess it's between 16-18 keepbits here, looking at trailing zeros but it's obv impossible to say for sure without knowing the full precision values.

That gives you an upper bound for the bitinformation, but I'd still just run it over and see what the bitinformation analysis says?

rsignell-usgs commented 1 year ago

Will do @milankl !! Thanks for the great rounding examples also. I feel like I'm starting to understand this stuff! :)