milankl / BitInformation.jl

Information between bits and bytes.
MIT License
33 stars 3 forks source link

Understanding latitudinal bounds of bitrounding absolute error #38

Open aaronspring opened 2 years ago

aaronspring commented 2 years ago

I try and fail to understand the error introduced by bitrounding comparing poles and tropics.

I take surface temperature tos data from CMIP6 historical MPI-ESM1.2-HR, the analysis says I should keep 7 mantissa bits and bitround. Finally, I plot the time anomaly (to get rid of the pole-tropics gradient). Similar to Klöwer et al. Fig 3, I enforce decreasing levels of real bitwise information from left to right. (Decreasing by keepbits would be more useful.) Fig3_ano_HR

Below is the absolute error due to bitrounding. Fig3_ano_error_HR Note: keepbits=6 might have been too strong here, as the absolute error shows spatial features and not pure random noise.

What I cannot explain is the very low absolute errors in the high latitudes compared to higher rounding errors in the tropics and the quite abrupt cutoff at frontal lines. Anyone an idea? The temperature values are the pole are lower than in the tropics, but that doesnt explain this IMO.

data: CMIP6 tos historical last 10 years

all slides: https://docs.google.com/presentation/d/1VeFqeCTqsFc68dN4KuKsqMXVpieFpQblu6kLioijf_U/edit#slide=id.g1257457c99f_0_78

milankl commented 2 years ago

Very nice plots! I assume you analyse the bitinformation of Kelvin instead of ˚C?

aaronspring commented 2 years ago

Quickly ran xbitinfo wrapping BitInformation.jl on the tropics and extratropics separately:

So its just the temperature gradient causing this, right?

Analysis and bitrounding done in degC, only for plotting converted to anomaly.


ds["tos"].attrs
{'standard_name': 'sea_surface_temperature',
 'long_name': 'Sea Surface Temperature',
 'comment': 'Temperature of upper boundary of the liquid ocean, including temperatures below sea-ice and floating ice shelves.',
 'units': 'degC',
 'original_name': 'tos'}
milankl commented 2 years ago

No, you are using ˚C 😄 What you can see are the 4,8,16˚C isolines of sea surface temperature. With bitrounding the rounding error that's introduced is an absolute error within every power of 2, but approximately a constant relative error across powers of 2. Hence for a constant number of keepbits, values between 8 and 16 have a higher absolute error under rounding than values between 4 and 8. This is just a consequence of floating-point numbers.

So the underlying question is more whether floating-point numbers are the best number format to represent sea surface temperature. Given that sea surface temperature is a rather linearly distributed variable (as in its histogram doesn't span orders of magnitude) one could also use linear quantization to encode these values (similar to the scale, offset-integer options in netcdf) this would give you a constant magnitude of the rounding errors across the globe.

Alternatively, you can also use Kelvin instead of ˚C. For Kelvin, the relevant powers of 2 are 256 and 512 and all sea surface temperature values should be between the two. Hence all floats used to describe these temperatures are linearly distributed in the required range. However, with Kelvin you bitinformation analysis should tell you to keep more mantissa bits as the information is shifted towards the tail of the mantissa bits. Then your lossless compression algorithm has to take care of the redundant exponent bits (sign and all exponent bits are identical when all values are in [256,512).)

aaronspring commented 2 years ago

so it is a bit dangerous to run bitinformation on a field with such a strong gradient in space and do bitrounding? I try to understand how to not run into such a situation without prior knowledge.

For global data, I could assume that there might be a gradient along latitudes and find keepbits for different latitudinal bands.

aaronspring commented 2 years ago

a similar analysis challenge when bitrounding multi-level/depth data where the surface ocean layer, TOA or atmospheric surface layer has very different values compared to non-boundary layers. here, analysing and bitrounding 2d fields seems saver IMO.

milankl commented 2 years ago

so it is a bit dangerous to run bitinformation on a field with such a strong gradient in space and do bitrounding? I try to understand how to not run into such a situation without prior knowledge.

I wouldn't describe it as dangerous, as it still does exactly what it is supposed to. Apparently here you are in a situation where you'd like rigid absolute error bounds not relative ones. Floats guarantee you a hard cap on the relative error, but the absolute errors double with increasing powers of 2. So your choice of absolute errors already conflicts with the choice of the binary encoding, floats. Remember that bitinformation does not force you to any binary encoding, so that choice is on you. Sure you can use loads of mantissa bits to overcome the problem, but it's actually that one shouldn't round to a globally constant keepbits in the absolute error+values across several powers of two-situation.

The underlying problem is that ˚C is from the perspective of data an arbitrary offset from Kelvin. Sure, (fresh) water freezes at 0˚C but that's about it. So today I learned that it's important to remind people of premisses have to be checked / questions have to be answered before doing the bitinformation+round business

1) is the data rather linear or logarithmically distributed? 2) which binary encoding is most appropriate given 1. (integer/fixed-point/linear quantization vs floats) 3) Analyse the bitinformation within that appropriate encoding 4) Bitround in that encoding too (and you'll either have rigid absolute error bounds for linear or relative for log)

Problem is obviously that most people want to use floats regardless of what data they are handling. Sure that makes sense. So for linearly distributed data (where you want absolute errors to have rigid bounds) we have to come up with a workaround to better adhere to 1.-4. while using floats. Let's take your sea surface temperature example and say you have the high precision data in ˚C and definitely want to store the data as ˚C, so what can one do?

  1. shift the data into a range where floats are linear, i.e. all data is within a power of 2. For ˚C you could convert to Kelvin (although I'd advise not to* unless you store in K) but you can also just add 256 (if no neg temperatures) or 512, a power of 2 is a good choice. Now your encoding matches your data distribution.
  2. analyse the bitinformation now. While this gives you mantissa bits, this actually suggests an absolute error and not a relative one for your data as all exponent bits are identical anyway.
  3. now subtract your initial offset again and you'll get more mantissa bits of precision for higher temperatures and fewer for lower temperatures so that your max absolute error is globally constant.

quick illustrative example

julia> sst = 30*rand(Float32,100_000)    # some high precision data in ˚C
100000-element Vector{Float32}:
 11.733397
 13.493755
 14.13639
 12.861835
 25.827908
  7.3809776
  0.36545455
 16.824009
  4.925149
 17.725273
  ⋮

julia> offset = 256
julia> keepbits = 11    # get that from bitinformation(sst .+ offset)
julia> sst_rounded = sort(unique(round(sst.+offset,keepbits)))
241-element Vector{Float32}:
 256.0
 256.125
 256.25
 256.375
 256.5
 256.625
 256.75
 256.875
 257.0
 257.125
   ⋮
 285.0
 285.125
 285.25
 285.375
 285.5
 285.625
 285.75
 285.875
 286.0

Now you have your data rounded to 0.125K accuracy (sort(unique...) to better illustrate that). And subtracting the offset you'll still have loads of zero mantissa bits, but the effective keepbits increase with higher temperature to have a constant max abs error. Yay!

julia> bitstring.(sst_round .- offset,:split)
241-element Vector{String}:
 "0 00000000 00000000000000000000000"    # zero or one keepbit here
 "0 01111100 00000000000000000000000"
 "0 01111101 00000000000000000000000"
 "0 01111101 10000000000000000000000"
 "0 01111110 00000000000000000000000"
 "0 01111110 01000000000000000000000"     # at least two here
 "0 01111110 10000000000000000000000"
 "0 01111110 11000000000000000000000"
 "0 01111111 00000000000000000000000"
 "0 01111111 00100000000000000000000"    # > three
 ⋮
 "0 10000011 11010000000000000000000"
 "0 10000011 11010010000000000000000"    # > seven
 "0 10000011 11010100000000000000000"
 "0 10000011 11010110000000000000000"
 "0 10000011 11011000000000000000000"
 "0 10000011 11011010000000000000000"
 "0 10000011 11011100000000000000000"
 "0 10000011 11011110000000000000000"
 "0 10000011 11100000000000000000000"
milankl commented 2 years ago

analysing and bitrounding 2d fields seems saver IMO.

Yes that seems a good practical advice.

aaronspring commented 2 years ago

Problem is obviously that most people want to use floats regardless of what data they are handling.

True.

There is still so much to learn and understand about bits for me...

Never thought about this so far...

  1. is the data rather linear or logarithmically distributed?
  2. which binary encoding is most appropriate given 1. (integer/fixed-point/linear quantization vs floats)
milankl commented 2 years ago

There is still so much to learn and understand about bits for me...

For me too! Thanks for posting these issues, I think together we are creating a great resource of do's and don'ts for data compression. Maybe we can even write them down somewhere in a more concise form!?

aaronspring commented 2 years ago

The first idea of @observingClouds and me was writing this python wrapper xbitinfo, which makes your julia code easily callable for python and integrates some first standards for saving compressed netcdf or zarr. But just being able to call bitinformation isnt effective until there are more guidelines of what to do an not to do.

Maybe we can even write them down somewhere in a more concise form!?

Definitely. Do you mean something like a jupyter notebook were code and explanation close to each other? Maybe even a collection of how to handle all different kinds of variables:

milankl commented 2 years ago

Do you mean something like a jupyter notebook were code and explanation close to each other?

Yeah, or just a chapter in the documentation which can illustrate these issues with code and figures. Sure it won't be interactive like a notebook, but I don't know where to upload all that data for an interactive notebook which might just be an overkill.

A collection of how to handle all different kinds of variables:

* `total precip` as an increasing variable over time

* `tos`/`tas` with a strong latitudinal gradient

* `nox` with strong local hotspots

* `some variable` across many ranges of 2**x as above

* `3D_variable` where boundary layer has different distribution than inside layer

* ...

I reckon that many of these issues can be abstracted to similar problems as described above, but it would be good to discuss them from the user/data perspective and not just abstractly from the bit-perspective!