Low number of `keepbits` using `get_keepbits` in `quick-start.ipynb`

observingClouds / xbitinfo

Python wrapper of BitInformation.jl to easily compress xarray datasets based on their information content

https://xbitinfo.readthedocs.io

MIT License

54 stars 22 forks source link

Low number of `keepbits` using `get_keepbits` in `quick-start.ipynb` #162

Open coxipi opened 2 years ago

coxipi commented 2 years ago

xbitinfo version: 0.0.3
Python version: 3.10.4
Operating System: Linux (Pop!_OS 22.04 LTS)

Description

I am trying to verify if the library identifies a reasonable number of bits to keep. I am working with the quick_start.ipynb notebook.

I plotted ds and ds_rounded side-by-side for variable u (I fixed month and level to their first value). I also checked variable v and get similar issues. I expected that by looking at the plots, I would not be able to tell which dataset had been rounded. Instead, I can clearly see which dataset has been rounded.

What I Did

Plotting both original and bitrounded u in the notebook quick_start.ipynb (as taken on current main branch)

import matplotlib.pyplot as plt
var = "u"
fig, (ax1,ax2) = plt.subplots(1,2,dpi=200)
ax1.set_title("uncompressed")
ax2.set_title(f"99% information (keepbits = {keepbits[var][0]})")
ax1.imshow(ds[var].isel(month=0,level=0))
ax2.imshow(ds_bitrounded[var].isel(month=0,level=0))

u_compression

More info

I'm pretty sure this is related to function get_keepbits and _cdf_from_info_per_bit. Prior to keeping only 99% of bit info, there is preliminary cleaning step where only bits that respect: bit info > 1.5*max(last 4 bit info) are kept. This process is shown in the following figure: u_bitinformation

The comment in the _cdf_from_info_per_bit function explaining the cleaning step is # set below rounding error from last digit to zero. Does this mean that the large values of bit info for the last bits are due to this rounding error? The idea behind the procedure is that one should not trust bit info which are of comparable size to these last bit info?

I could agree with that, but then, how would we interpret the first comparison figure I show? The original plot seems more physical than the rounded one. If we were to trust get_keepbits, should we conclude we should not trust the data too much (beyond the level of precision of keepbits)?

Less strict condition

Say that we can still trust mantissa bits with small information content even though the last few bits have suspiciously large information. Could the criteria to trust or not bit information be given with respect to bit information variation rather than its amplitude? We expect that bit information should generally decrease as we look further down the chain (higher mantissa bit number), correct? With this in mind, we could simply identify the bit with minimum information (or the bit with 4th minimal information, to mimic your condition) and drop higher mantissa bits (because the information on the right of this bit will increase, which is the behaviour we consider as problematic). Something more refined, e.g. cutting mantissa bits once a sequence of 3 increasing bit info is observed, could also be considered.

aaronspring commented 2 years ago

you are referring to the visible contour lines in the first plot I guess

we should add such a plot and a difference plot to the quick start

are you suggesting to rather just set the last 4 bits to 0 instead of all bits lower that the max 4 last bits?

@milankl on discussing the criterion

coxipi commented 2 years ago

Yes, the visible contour lines.
Sounds good
Yes, the idea I had for a new method could set the last 4 bits to 0. In the case of my first comment above, it would.

The general idea I had is to identify in some way the bit(s) with minimal information, and use this as a reference to say if the other bits have an important amount of information or not. The method could for instance be: 1) Find the 4 consecutive bits that collectively have the least information. 2) Set the following mantissa bits info to 0, i.e. that would include the last 4 bits in the case shown above. If the last 4 bits generally don't have the minimal information, then yes, the last 4 bits would generally be cut. 3) Use your technique with the 'cleaned' bits (use the 1.5*max(4 bits info of step 1))

This is working with the assumption that bit info in the mantissa bits should decrease more or less in a monotone way (I will read more on this). So if the bit info start increasing again, we could just discard this. But this might lead to other problems, there are cases where the bit information remains at an appreciable level for all mantissa bits. In general, I see many behaviours for how the bit info evolves along the bit chain. Setting a hard cutoff on mantissa bits beyond the minimum might be risky. Maybe the approach I describe in the next comment (that I found in a notebook by @milankl) is better suited in general.

coxipi commented 2 years ago

In the notebook a_tutorial.ipynb accompanying your paper @milankl, there was this method:

keepbits_var = argmax(cumsum(bitinf_var)/sum(bitinf_var) .>= 0.99) - 9    # subtract 9 to count mantissa bits

(for "var" in {T,q,CO2}). Perhaps you also have something like get_keepbits in your library, but I didn't find it.

Using this method in the case I described above, we don't reach the 99% info threshold until the last mantissa bit because of the large information content in the last bits.

This method is safer, it will only cut bits in situations where it is clear the bit information content is negligible compared to the rest. I have seen many different behaviours for bitinformation plots depending on the variables, so I'm not sure the method I proposed in my previous comment can be applied systematically.

Here is an example for precipitations. I used the BitInformation.jl in this case to get bit information (along with the method to get keepbits I show in this comment). A large number of bits is kept (keepbits==21). Even if I don't count the last bit, I still get keepbit==19, still a large number. This a case where the minimum of bit information among mantissa bits happens for 'm6' mantissa bits, and there is slight increasing of the bit information after that.

milankl commented 2 years ago

Unfortunately, your bitwise information has some artificial information included, which becomes evident in the last mantissa bits:

1) u (or v)

I believe this data was somehow lossily compressed before. For example, any kind of quantisation (include the offset+scale options that some netcdfs use) can introduce an artifical mutual informaiton in the last bits which isn't actually physical. I therefore always advise to use untreated Float32/Float64 data to do the bitwise information content analysis. The algorithm itself doesn't know about binary encodings and just treats every bit position independently. Assume you don't have access to a high precision version of the data: what can you do? 1) Given that your data spans both plus and minus and is also smaller and bigger than 1 (because you have non-zero information in the sign and in the first exponent bit) I would recommend trying to use a signed exponent instead (signed_exponent!(::Array) in BitInformation.jl does that for you). 2) Just visually inspecting from the plot, m4 should be kept as well (its information is 0.1 and hence significantly larger than 0, BitInformation.jl does a significance filtering for you, @aaronspring xbitinfo switches that on by default, right?). Having one more keepbit should also remove the visible contourf line.

So yes, you can just set the 4 last mantissa bits to zero and repeat the 99% threshold calculation -- the 1.5*max idea is more applicable when you have information that never really drops to zero, which we'll come to now:

2) precipitation

Precipitation is interesting as it is usually a variable with many zeros but then some hotspots. Many exact zeros means that there is some information even in the last mantissa bits because there is a mutual information going from one grid point to the next. (If it doesn't rain in one grid point it's likely that it doesn't rain in the next either). However, it's not really important to have a comparably high number of keepbits (your analysis says 19 or 21) because rounding doesn't affect the zeros. This is the case where I'd try the c*max(inf in last mantissa bits) because while there is some (presumably real) information in m5 to say m22 it's not really information you are intersted in, and I agree with you m4 looks like a good threshold here. As m23 increases slightly again, I'm wondering whether you are dealing again with data that was previously lossily compressed, please check that.

In general: You clearly found two examples of variables/data where the interpretation of the bitwise information content is tricky and not always straight forward. That's definitely one of the bigger caveats with this method. I believe the algorithm is so naive that it's reliable, but how you count the 99% (or any other <100% threshold) especially when the data was previously compressed isn't always straight forwardad and I haven't found a reliable translation between the bitinformation distribution and # of keepbits that works in less ideal cases too. So thanks for joining the discussion here and I hope my comments are helpful to better understand what's going on.

aaronspring commented 2 years ago

2) Just visually inspecting from the plot, m4 should be kept as well (its information is 0.1 and hence significantly larger than 0, BitInformation.jl does a significance filtering for you, @aaronspring xbitinfo switches that on by default, right?).

Exactly. set_zero_insignificant (bool): defaults to True (julia implementation) https://xbitinfo.readthedocs.io/en/latest/api/xbitinfo.xbitinfo.get_bitinformation.html#xbitinfo.xbitinfo.get_bitinformation

aaronspring commented 2 years ago

My rule of thumb that nearly always works: Take 99% and then two additional bits to be on the safe side

milankl commented 2 years ago

My rule of thumb that nearly always works: Take 99% and then two additional bits to be on the safe side

Haha, sure, but for @coxipi's precipitation example +2 seems like an overkill, it should be more like -14 😉

coxipi commented 2 years ago

Haha, that's true, but better be safe than sorry, thanks for the tip!

It's a very nice topic theoretically with exciting benefits for data management, so I'm glad I can join the discussion! I'll be working on this in the next few months for Ouranos , we will be testing this on our datasets. I'll keep you posted.

Thank you for your detailed explanations @milankl, I understand better limitations and what to look for.

By cutting the last 4 bits in the first example I gave, I get 'm4' or 'm5' (depending on how I exactly I formulate the new condition), so that matches your visual intuition of needing at least one more bit.

As for the questions regarding compression:

The dataset eraint_uvz is loaded from the . I could not find information on the compression procedure, but if I had to guess, I would say it's probably compressed lossily indeed, it's meant to be used in tutorials, probably just to discover Xarray features.
The dataset was taken from an Ouranos server, it's a modification of .
- hourly -> daily, i.e. sum over time
- total precipitation (tp) -> precipitation rate (pr) (m -> kg m-2 s-1), i.e. multiply by a factor
- Compression to zarr

I applied the first two steps on data that I directly download from ERA5. I look at 2005-01-01 for 60.0 < lat < 72.25 and -129.9 < lon < -117.65. This is a subset of what I showed earlier, but it's sufficient to see the same kind of problem. This new dataset should be equivalent to what I retrieve from our server on Ouranos. I show the bit information here:

comparison_plot

The green line is the bit information obtained from the dataset where I applied the first two steps above on the ERA5 dataset, and the red line is the analog dataset's bit information from Ouranos. The bit information are similar, but you can see the red line (Ouranos) is more noisy.

I will have to look more into the compression we use when storing the data. Maybe there are other manipulations I have overlooked. There are noticeable differences in some spots (some spots where one dataset is exactly 0 and not the other, and vice-versa).

The workaround you discussed for poorer quality data could come in handy, I'll keep this in mind. But ideally, we should try to avoid getting there, so I'll make sure I can use the data earlier in the pipeline!

observingClouds commented 2 years ago

@coxipi please keep in mind that ERA5 data is originally written in the GRIB format and even if you select to download the netCDF4 files from e.g. the Copernicus Climate Data Store, the files are just converted from the GRIB files, which are already lossy. As a consequence, the calculation of the mutual information does not provide useful results.

observingClouds commented 2 years ago

Just as an example:

the variables are saved as int16 (with offset and scaling factor)
the history in the downloaded dataset shows that the netCDF-file is just a converted GRIB file

coxipi commented 2 years ago

Ah, got it, thanks!

The example dataset eraint_uvz also has int16 variables. You could use dTdx and dTdy from air_temperature_gradient instead which are float32 (there might be other issues I'm not aware of with this dataset, but setting decode_cf=False as you did for era5 data, I get float32). Using your functions in xbitinfo as-is, with your additional +2 rule of thumb. The +2 rule here seems important it allows to keep mantissa bits with high bit information relatively to other mantissa bits. For dTdy, without the +2, we would see the difference between original and 99% info.

# Compressing dTdx and dTdy
import xbitinfo as xb
import xarray as xr
import matplotlib.pyplot as plt

ds = xr.tutorial.load_dataset("air_temperature_gradient")
bi = xb.get_bitinformation(ds, dim="lon", set_zero_insignificant=True)
kb = xb.get_keepbits(bi, 0.99)
ds_br = xb.xr_bitround(ds, kb+2)

fig = plt.figure(figsize=(14,3))
varss = ["dTdx", "dTdy"]
for var in varss :
    bi[var].plot(marker="o")
plt.legend(varss)
plt.ylabel("bit info")

fig = plt.figure(figsize=(14,3))
for var in varss :
    bi[var].isel(bit32=slice(9,None)).plot(marker="o")
plt.legend(varss)
plt.ylabel("bit info (only Mantissa)")

ti = 0
for var in varss:
    fig, (ax1,ax2) = plt.subplots(1,2,dpi=100)
    ax1.set_title(f"Original {var}")
    ax1.imshow(ds[var].isel(time=ti))
    ax2.set_title(f"99% info (keepbits = {kb[var][0]+2})")
    ax2.imshow(ds_br[var].isel(time=ti))

rsignell-usgs commented 1 year ago

Just discovered this. I love the discussion, but keepbits+2 is not very satisfying because it seems one of the benefits of this method was the quantitative, defensible approach, right?

@coxipi, did you make any more progress on this topic?

coxipi commented 1 year ago

@rsignell-usgs I didn't focus on this aspect lately (still looking at data with lossy compression somewhere in the pipeline as discussed above).

The approach "inspecting all bits and keep 99% info" is surely safer. If the profile of bit information doesn't have close to zero info in the trailing bits, then this method will give more strict rounding (higher keepbit). Still, I think the method here in xbitinfo has the right idea, and finding a way to get better compression is important.

An alternative approach could be to inspect the bit information profile to find where information starts stabilizing in the trailing bits. This would cover the case of variables where bit information converges to a finite value (zero or non-zero). Some tests could be added to raise a flag if this constant value is suspiciously too high.

Even then, it would still leave some probably arbitrary values to determine, like: what is a considered a "small variation" of info that allows to define the plateau. It could be a certain % of the total bit information. Also, once you define a point where the plateau starts, you might find yourself in the same position as before, wanting to add one or two bits to keepbits just to be sure. Here, the argument is geometric from the onset, so maybe it would be more acceptable in this case.

milankl commented 1 year ago

Just a note on artificial information (what we call it in the paper):

it can plateau, as you say, in which case you may want to define some reasonably high threshold and set all information below it to zero
but it can also increase (>0.1bit) in the last mantissa bits in which case you probably don't want to use the plateau approach because you then may also remove some small, but real, information in the exponent/first mantissa bits.
first thing I would always do is to check whether you can get hold of the same data set without any previous lossy compression: Quantization in another format than the one you are analysing the bitinformation in (e.g. linear packing vs float32) can cause artificial information.

It's an open problem how to distinguish between artificial and real information without prior knowledge. So if you come up with any ideas, feel free to communicate them.