observingClouds / xbitinfo

Python wrapper of BitInformation.jl to easily compress xarray datasets based on their information content
https://xbitinfo.readthedocs.io
MIT License
52 stars 20 forks source link

Artificial information #209

Open milankl opened 1 year ago

milankl commented 1 year ago

We just discussed the artificial information problem and illustrated it

artifical_information

So, ideally, one would have an original dataset (blue), analyse the information, bitround it and obtained a bitrounded dataset (red) such that the information is simply truncated at the, e.g. 99% threshold. However, most climate does not come in full precision but is already quantized (orange). E.g. all data from ECMWF is quantized, inclusing ERA-5, which is very popular and probably the most downloaded climate dataset out there. Having only quantized data, dequantization often creates artifical information which complicates the analysis and especially poses users of xbitinfo with the problem of "what is that" and we would need to explain artificial information and how to circumvent or avoid this problem when assessing the number of keepbits.

There might be several solutions to this problem, illustrated here are two:

1) If the data comes quantized actually do the information analysis in the quantized integers (and not in floats after dequantization) obtain some keepbits wrt the integer encoding and translate them to an error (relative/absolute) that could then be translated to keepbits for floating point numbers. How well this works, we don't know but it would be awesome to test.

2) Given an original dataset we have the true information distribution at hand (purple line), meaning that if we obtain artificial information information there might be a way of correcting that. For example, there might be some simple way of flagging some information as artificial, so it an be removed and then obtain the keepbits from there. Maybe that's a good supervised learning problem to find a function that maps real + artifical information (black) to real information only (purple). This would be another way of addressing the problem.

faze-geek commented 5 months ago

I am interested in working on this problem. I did come across it on the GSoC ideas page. Can someone help me understand what is the current status of this problem statement with #234 in place. What is the future work targeted here to extend the project ?

observingClouds commented 5 months ago

Hi @faze-geek,

234 is a first implementation to filter artificial information but it needs more robust testing and might not work in all cases. We would like to extend this filter by another one that is derived from information theory and that is less subjective on thresholds and can be mathematically be proven to be generally applicable.

(cc @milankl )

milankl commented 5 months ago

We tried possibility 2 from the graph above, which is trickier than we thought, because the "somehow" isn't well constrained and so it doesn't generalise as easily. We haven't tried possibility 1 yet, which I hope would be more promising.

thodson-usgs commented 5 months ago

Dequantizing works well. Try something like

import numpy as np

def dequantize(array, sig_fig):
    a = 0.5 * 10**sig_fig
    noise = np.random.uniform(-a, a, size=array.shape)
    return array + noise 

ds = xr.tutorial.open_dataset("air_temperature")
ds['air'] = dequantize(ds['air'], sig_fig=-1)

I'm assuming the data were rounded rather than truncated, but I don't think this will make any difference in practice.

milankl commented 5 months ago

I think we're talking in different directions here. Dequantizing is what can cause the artificial information in the first place?

thodson-usgs commented 5 months ago

Ah, sorry for walking in without the correct context. I meant that the function above is a simple way to transform the black line into the purple line: image Starting with the quantized data (10.2) add some noise to the unsig figs (10.2 + 0.01023461230874), then compute the mutual information.

JoelJaeschke commented 2 months ago

I played around with this a little out of curiosity. I think it is possible to expand on the idea of @thodson-usgs a bit, considering how the quantization is done.

Assuming the scaling factor for quantization is derived as $$s=\frac{2^n-1}{max(data) - min(data)}$$ where $n$ is the number of bits of the quantized data, i.e. 16 in the case of uint16/int16, as is common for GRIB files from the CDS for example. Consequently, the smallest difference that can actually be stored should be $\frac{1}{s}$ and all content below that is likely to be numerical noise.

One way forward I could see is to open the file in question as usual, read the scaling factor and the number of bits used to store the quantized data from the file metadata for each affected variable and then add noise that is maybe one order of magnitude smaller than the $\frac{1}{s}$ threshold to it for the determination of significant bits.

I hope I am not pointing out the obvious here; if so, sorry! I would be curious if there are any gaping issues in my logic or whether the general approach may be worth pursuing further. Some unstructured experiments and sample code can be found here: https://gist.github.com/JoelJaeschke/5951314b62e112b0ccbe98fe75b8b997

milankl commented 2 months ago

No it's absolutely possible to use information about the linear packing that was previously applied to set some bound on the precision instead of assuming that a float32 has full float32 precision. But I think it points to the same problem, you need to use some information before the decoding of the linear packing happened, that's why I thought why not directly compute the information content from the quantized uint values? That'll give you, say, 14 bits of the uint16 having information so you to chop those last two off if you want to stay in quantised space or if you want to decode back to floats you can translate that to keepbits. In this case $4/s$ because you have $n=14$ bits not $n=16$.

JoelJaeschke commented 2 months ago

It does make sense what you are suggesting and have played around with this some more, using some ERA5 wind speed data. When running get_keepbits on the dequantized field, I get a value of zero keepbits and can see the artificial information in the tail of the mantissa bits. For the remaining steps, assume that $$x\text{dequantized} = \text{scale}\cdot x\text{quantized} + \text{offset}.$$

I don't know if you already have some idea in mind about translating the keepbits from quantized space to dequantized space again, so I thought about the following. The smallest change in dequantized space we can resolve is the scale parameter itself. So if we get $n\text{quant}$ keepbits in quantized space, and assume $16$ bits for the storage, as you say, we can at best resolve $$2^{16-\text{n}\text{quant}}\cdot\text{scale}$$ in dequantized space.

To translate the integer keepbits to float keepbits, I thought about the following: Assuming the full integer range is used for the quantized storage, i.e. the maximum dequantized value is $$max(x\text{dequantized})=32767\cdot \text{scale} + \text{offset},$$ we can derive the previous- and next-closest power of two for the maximum dequantized number, $x\text{dequan,p}$ and $x\text{dequan,n}$. To derive the available float precision, $p\text{f}$, we know that we can view the mantissa bits as an "interpolation" between the previous and next power of two of a floating point number, so we get $$p\text{f}=\frac{x\text{dequan,n} - x_\text{dequan,p}}{2^{23}}$$ in the case of float32 at the maximum dequantized value. However, in reality $p_f \gg p_r$, where $p_r$ is the "real" precision necessary due to quantization. We can now solve for $p_r$, using $p_r \approx \text{scale}$, and get $$n_r \approx ceil(log2\left(\frac{x\text{dequan,n} - x_\text{dequan,p}}{\text{scale}}\right))$$ for the number of keepbits in floating point space, $n_r$. This way, the floating point numbers have at least as much precision as the resolvable scale of the truncated quantized values.

I have not really given any thought to the error bounds, however, one thing that came to mind is that the error will likely vary in dequantized space, due to the inhomogeneous distribution of floating point numbers. Taking the example of wind speed again, near the surface, values may be on the order of $10~\text{ms}^{-1}$, while at high altitude, they may be at the order of $100~\text{ms}^{-1}$. Considering floating precision again and setting $n_r=10$ for the example, near the surface, we may get a precision of $$\frac{16-8}{2^{10}}=0.0078125,$$ whereas at high altitudes, we get $$\frac{128-64}{2^{10}}=0.0625,$$ so almost an order of magnitude less precision after the comma. As the relative error remains about the same I think, which I suppose is what most people are interested in anyways, this should be fine I guess? From some experimentation, errors near the maximum values seemed to be around halve of the resolvable scale at most in dequantized space.

Again, I hope I am not stating the obvious here, but do you see issues with this?