milankl / BitInformation.jl

Information between bits and bytes.
MIT License
30 stars 3 forks source link

Where is the best place to discuss usage/interpretation/best practices? #25

Open rsignell-usgs opened 2 years ago

rsignell-usgs commented 2 years ago

@milankl , where is the best place to discuss usage/interpretation/best practices with usage of these tools?

For example, I've got two questions applying these techniques to a new CONUS WRF run. I've got a CONUS404 Jupyter notebook where I started with your tutorial and wrote a little loop to go through all the variables:

2021-12-14_15-18-05

I'm wondering what it means that some variables don't have continuous distributions in many of the variables in the cell [8] figure. Some are continuous like V but others like SST are pretty strange looking. Is this expected for some variables or does this mean I did something wrong, or didn't use enough data?

I only used one 2D field here, so I was thinking of making a more robust assessment by selecting 100 fields or 1000 fields through the year and seeing what the histogram of keepbits for each variable looks like. And then picking something on the conservative side to apply. Does that sound reasonable?

milankl commented 2 years ago

where is the best place to discuss usage/interpretation/best practices with usage of these tools?

Here 😄

I'm wondering what it means that some variables don't have continuous distributions

Can you reproduce the figure with a signed exponent for the floating-point representation? i.e in your notebook

vals = var.second[:,:]
signed_exponent!(vals)    # add this line
a = bitinformation(vals,dims=1)

While vals will look wrong (the values won't make sense), because Julia doesn't know that the exponent bits are supposed to be interpreted differently (this information is not explicitly stored), bitinformation will work as before. The reason is that in the biased exponent of floats many bits often flip together, e.g. going from 1.9f0 to 2f0

julia> bitstring.([1.9f0,2.1f0],:split)
2-element Vector{String}:
 "0 01111111 11100110011001100110011"
 "0 10000000 00001100110011001100110"

which is prevented with

julia> bitstring.(signed_exponent([1.9f0,2.1f0]),:split)
2-element Vector{String}:
 "0 00000000 11100110011001100110011"
 "0 00000001 00001100110011001100110"

For more information see here

milankl commented 2 years ago

Regarding SST, and other data that will include some form of missing values, this is something that provided me a lot of headache in the past. Reason being that you don't want to calculate the mutual information across boundaries. Ideally, one would use some form of space-filling curve to avoid the missing values in data, but that's a bigger project. Technically, it should also be possible in the future to pass a masked array to bitinformation, but I haven't decided yet how to best implement that. so for the time being, could you use a large chunk of SST without any land points?

rsignell-usgs commented 2 years ago

@milankl , ah that makes sense! I hadn't thought about the missing data in some fields -- SOILMOISTURE is missing over the ocean, SST is missing over the land, etc. Okay, I will revamp my notebook to select regions for variables that don't have missing values.

Do you have thoughts how to best calculate the keepbits for each variable when you have 40 years of simulation?

I only used a single time step of model output for the analysis above, and I'm assuming so I was thinking of making a more robust assessment by running BitInformation on increasing numbers of time steps chosen randomly from the simulation until I get a stable distribution of the keepbits value for each variable. Then pick the mode or the highest value from the distribution to use as the keepbits value for compression.

Or maybe it's smarter to average all the bitinformation from multiple time steps before determining the keepbits?

Or is there a different recommended approach?

milankl commented 2 years ago

Yes, you can do that. However, it might an overkill. I found the bitwise information fairly robust in general, such that I'd probably start by looking at a time step in the beginning of the 40yrs one at the end. Then maybe also check how the information changes in the different dimensions. We did this analysis for temperature in Fig. S10

This highly depends on the resolution in each dimension but I usually found longitude to have the highest correlation. Let me know if yours differs.

rkouznetsov commented 2 years ago

@rsignell-usgs The bitinformation method is applicable for relative precision trimming and relies on two assumptions that unfortunately were not put upfront in the paper:

(@milankl please correct me if I am wrong).

The former assumption is violated for winds, and the second one for SST with missing values. Both are violated for cloud quantities from the paper, causing absolutely unrealistic estimate of 1 mantissa bit per value needed.

As a practical solution for winds, temperatures and few other quantities I use trimming that keeps absolute precision. Please let me know if you need any further info on that or on lossy compression of WRF data (we did it for four-years hourly dataset).

milankl commented 2 years ago

The bitinformation method is applicable for relative precision trimming [...] the signal must have purely multiplicative noise

This is correct when used with floats, but due to floats and not due to the method itself. You can use other bitencodings like linear quantisation and calculate the bitinformation based on those. This will bound the max abs error and therefore using only n bits (as obtained from the bitinformation analysis) corresponds to an absolute precision trimming. I've done that once for surface specific humidity

image

(I can't remember the compression factors, but they are presumably similar, given that the lossless compression will compress away the unused exponent bits in Float32 ecoding).

for quantities of a small dynamic range additive component still behaves as multiplicative

I've always struggled to find a geophysical variable with additive noise but a wide dynamic range. I assumed that this is due to the scale-invariance of the Navier-Stokes equations, such that any error would scale as well. You mention winds, but why do you consider them to have an additive noise? Thanks for the input, I think fundamental answers here are very important to take away some subjectivity that always went into choices of data compression!

temperatures and few other quantities I use trimming that keeps absolute precision

Yeah, for temperature in Kelvin chopping of mantissa bits preserves an absolute precision as long as all your values are in [256K,512K), because floats are linearly distributed within a power of two. For the same number of mantissa bits, values in [128K,256K) will have twice the absolute precision compared to [256K,512K) etc.

the statistical properties of the the signal must be uniform for the whole field

Yes, but note that what non-uniform statistical properties mean for the bitwise information content (given a bitwise encoding!) is not trivial. For example a change in the mean (e.g. going from ˚C to K) will flag different mantissa bits with information when using floats, but will not have an impact when using linear quantisation as bitencoding. On the other hand, a scaling (e.g. going from micro to milli-some unit) will not affect the bitwise information in float encoding (and is bitwise reproducible if the scaling is a power of two). In that sense, a change in variance might not classify as non-uniform statistical property. But hey, really understanding what these mean is very important and I'm glad that people ask these questions, because I see them as very essential going forward!

SST with missing values

Yes, I haven't implemented yet a bitpair count (i.e. counting 00, 01, 10 and 11) in adjacent grid points that allows for a mask to not count bitpairs across boundaries. Technically that is possible.

However, @rkouznetsov on the discussion on non-uniformity, note that nothing stops one from preserving different mantissa bits for, say, different regions of a 2D field, or different vertical levels in the round+lossless framework. While this opens up a whole new door that I didn't want to address yet, it's no problem to choose n mantissa bits for one region and m mantissa bits for another. This is something that's not possible with a linear/logarithmic quantisation or other algorithms like Zfp without major hassle.

rsignell-usgs commented 2 years ago

@rkouznetsov and @milankl thanks for taking the time to help inform me and the community about how this all works. I think what got me really excited was the possibility to achieve much more compression in a scientifically defensible way, and this discussion seems key.

@rkouznetsov I absolutely would like to see/hear more about what you did with WRF output. Is it okay to discuss this here @milankl , or should we start a new issue?

milankl commented 2 years ago

Happy to keep this open, if there's more on a given topic that risks other points being lost here feel free to branch off into another issue.

rkouznetsov commented 2 years ago

@rsignell-usgs, @milankl I am not sure if "O3 mixing ratio field keeping 99% of information" (or should one keep 99.9%?) is any more defensible than 'accurate within 1/128 or 0.0625ppb'. At least I know how to interpret the latter for any specific application. Known error bars are better than potentially smaller but unknown ones. Making the precision-truncation separately for levels/subdomains would make the error quantification even more complicated.

@rsignell-usgs, WRF processing is indeed an offtopic here, but i see no better place to put the reply on it. Here is what I use for WRF archiving for SILAM use. It gets a bit more than x12 though with omission of a few variables, which you might wish to keep. All uncertainties are well quantified and are rather conservative according to an expert* (i.e. my :) opinion/experience. You might wish to adjust it here and there though:

#!/bin/bash
infile=$1
outfile=$2

list3d='U,V,T,QVAPOR,QCLOUD,QRAIN,QICE,QSNOW,QGRAUP,CLDFRA'
prc3d="--ppc U,V,T=.3 --ppc QVAPOR,QCLOUD,QRAIN,QICE,QSNOW,QGRAUP=.7 --ppc CLDFRA=.3"
  list2d='Q2,T2,PSFC,U10,V10,TSLB,SMOIS,SEAICE,SNOW,SNOWH,SSTSK,LAI,TSK,RAINC,RAINNC,SWDOWN,SWDNB,ALBEDO,UST,PBLH,HFX,QFX,ACHFX,SST'
prc2d='--ppc U10,V10,T2,SMOIS,SEAICE,SSTSK,LAI,TSK,RAINC,RAINNC,ALBEDO,SST,TSLB=.2 --ppc Q2,PBLH,UST=2'
  gridvars="SINALPHA,COSALPHA,MAPFAC_MX,MAPFAC_MY,XLONG,XLAT,XLONG_U,XLAT_U,XLONG_V,XLAT_V,MAPFAC_UX,MAPFAC_UY,MAPFAC_VX,MAPFAC_VY"
  othervars='XLON.*,XLAT.*,Times,ZNU,ZNW,ZS,DZS,HFX_FORCE,LH_FORCE,TSK_FORCE,HFX_FORCE_TEND,LH_FORCE_TEND,TSK_FORCE_TEND,FNM,FNP,RDNW,RDN,DNW,DN,CFN,CFN1,THIS_IS_AN_IDEAL_RUN,RDX,RDY,RESM,ZETATOP,CF1,CF2,CF3,ITIMESTEP,XTIME,P_TOP,T00,P00,TLP,TISO,TLP_STRAT,P_STRAT,SAVE_TOPO_FROM_REAL,ISEEDARR_SPPT,ISEEDARR_SKEBS,ISEEDARR_RAND_PERTURB,ISEEDARRAY_SPP_CONV,ISEEDARRAY_SPP_PBL,ISEEDARRAY_SPP_LSM,C1H,C2H,C1F,C2F,C3H,C4H,C3F,C4F'

ncks -4 -L5 --baa=5 -v $gridvars,$othervars,$list3d,$list2d $prc3d $prc2d --cnk_dmn bottom_top,1 $infile $outfile

baa=5 is to ensure that "bit rounding" is used. The default "bit adjustment algorithm" of nco has been changed few times during last 5 years.

Any feedback on this set of precisions would be welcome (personally, or publicly).

* Expert: a person who shows slides. :)

milankl commented 2 years ago

I am not sure if "O3 mixing ratio field keeping 99% of information" (or should one keep 99.9%?) is any more defensible than 'accurate within 1/128 or 0.0625ppb'. At least I know how to interpret the latter for any specific application. Known error bars are better than potentially smaller but unknown ones.

You can always communicate the error (abs/rel) as you prefer. The 99% level is just a guidance hopefully applicable to a wide range of variables, which then translates to a relative error (for floats/log-quantisation) or an absolute error (linear quantisation or floats within a power of two) that you can communicate explicitly as 'accurate within 1/128' but that information is also stored implicitly if all but the first 8 mantissa bits are zero.

Making the precision-truncation separately for levels/subdomains would make the error quantification even more complicated.

Yes, that's why I'm afraid of using it. But so then you agree that the issue around non-uniform statistics of a field aren't actually a drawback of the method, but reflect more underlying properties of the variable?

rkouznetsov commented 2 years ago

I've always struggled to find a geophysical variable with additive noise but a wide dynamic range.

I would not say for physical quantities, but I would argue that majority of model quantities do have both additive and multiplicative errors.

Consider e.g. clwc from a model that has two modules: cloud evaporation-condensation (microphysics) and spectral advection (transport) and uses process split. The microphysics module (I assume it is coded properly) would put clwc to zero everywhere, where rh < 100%. Advection would create some numerical noise over the whole domain due to a finite precision of a Fourier transform. If the output is set after microphysics step the domain would be full of zeros except for some small areas with smooth fields. The bitinformation ("real") of such a field will be quite high. If the output is set after transport, the field will have a lot of entropy in all bits in average, and have a low bitinformation.

Note that in both cases it is the same model, and both vays of output are perfectly valid and matching observations equally well.

You mention winds, but why do you consider them to have an additive noise?

The difference between 0.01m/s and 0.0001 m/s (factor of 100) is negligible, while 10m/s and 15 m/s differ quite substantially. Therefore, the additive model is more adequate here, if one has to chose. There seems to be a false contradiction between floating point with multiplicative noise/error and linear quantization with additive noise/error. Indeed with floating-point data can have both noise components quite naturally, and both should be specified unless one of them is negligible. By 'accurate within 1/128 or 0.0625ppb' I meant that the error of each value V is max( V/128, 0.0625ppb). With such a model one can make an error specification invariant of units and/or scaling. Also by trimming both absolute and relative precision one can get better compression ratios than with either of them alone.

So I consider bitinformation as an interesting and useful tool to study datasets, but it should not be used as a sole ground to decide on a number of bits to store. Uncertainties and artifacts originating from a model that produces the data, and sensitivity of a model/analysis downstream should be considered when deciding on lossy-compression parameters.

`

rsignell-usgs commented 2 years ago

So I consider bitinformation as an interesting and useful tool to study datasets, but it should not be used as a sole ground to decide on a number of bits to store. Uncertainties and artifacts originating from a model that produces the data, and sensitivity of a model/analysis downstream should be considered when deciding on lossy-compression parameters.

@milankl , assuming you agree with this statement, seems like something very important for readers of the Nature paper and potential users of BitInformation approach for compression to understand, right? One of the exciting ideas of the paper was a completely quantitative method for determining the true information content and number of bits to store. If that's not exactly true, it would be great to know "best practices/caveats" for using the approach. (or do I just need to read the paper more carefully?)

milankl commented 2 years ago

In principle I agree with the argument that one should consider different uncertainty quantifications if possible. But it's rarely straightforward what the uncertainty of a dataset is. Sure, one may think of a measurement device with some lab-quantified error. In that case, true, I would just discard precision according to that measurement error. But with any model simulation data there's so many errors that contribute to the uncertainty

and then it's also a question with respect to what the uncertainty should be quantified. E.g. I may change the discretization condition error by increasing the resolution. That'll give me an error estimate, but it won't tell me whether the error that comes from the cloud physics scheme is larger. Similarly, in an ensemble forecast the uncertainty increases over time. Eventually the uncertainty is so large that rounding the values to match the uncertainty, we are just left with climatology. This is because the ensemble forecast cannot provide any more information than the climatology at such long lead times. Similarly, calculating the bitwise information across the ensemble dimension will round away everything but the climatology (because that's what the bits across the ensemble dimension still have in common). However, if you want to investigate simulated hurricanes then you should keep many of those bits and the uncertainty relevant to your use case is another one. Calculating the bitwise information in the ensemble dimension and assuming that's the overall uncertainty of the data set is clearly not a good idea.

The bitwise information analysis cannot tell you a precision that is necessary to retain for every use case. So, yes, if possible include other uncertainty quantifications. But, and this is the issue that we often face, and presumably many others, it is near to impossible to obtain such uncertainty quantifications for hundreds of variables each with respect to hundreds of different scientific analyses. However, many use cases will find the mutual information of bits in adjacent grid points (as it's defined in the paper) as essential to preserve, hence, we propose this is a new perspective on the problem. In our data sets unifying the different precisions required for many variables into a single "preserve 99% of real information" works well and is the first approacht that I know that tries to generalise precision for gridded data into a single information-based concept.

milankl commented 2 years ago

So on the note of best practices, I would say

Maybe "real information" is by many understood as "the only true precision one should care about", this is not the case, neither can it be for any lossy compression method. In fact, "real information" is calculated with respect to something (as discussed further up in this thread), and one is free to change the predictor in the mutual information to account for that.

rsignell-usgs commented 2 years ago

Unifying the different precision required for many variables into a single "preserve 99% of real information" works well and is the first approach that I know that tries to generalise precision for gridded data into a single information-based concept.

@milankl, thanks for that pep talk. And I like the notes on best practices. I'm going to revisit the BitInformation calculations we made for our 40 year WRF simulation for the US and report back here with issues!

aaronspring commented 2 years ago

Thanks for the discussion here. We want to simply replace my netcdf files with BitRound, so the files stay all tools compatible.

We (@observingClouds and me) tried BitRound on monthly climate model output from the MPI-ESM-LR model on the biogeochemical output streams. I get a compression factor 7-9 preserving 99%, see https://gist.github.com/aaronspring/32f46291232f1a3145103978a9108459.

However, for fgco2 (co2flux between ocean and atmosphere with values from - e-9 to +e-9) the demo notebook recommends to keep only 1 bit, which leads to 20% relative errors, therefore I keep 5 (just guessing) more bits. But worked out of the box for spco2 (surface ocean CO2 levels) or atmco2 (surface atmospheric CO2 levels).

image

(sorry for the plot, I'm a first time julia user, who accidentially locked himself up into python2.7 pyplot)


relative difference up to 20% between raw and rounded when only keeping 1 bit (instead of 6) for fgco2

image
milankl commented 2 years ago

Hi @aaronspring many thanks for sharing this analysis!

1) Could you specify how your data is structured? I see that you are using some less regular grid (is that tripolar?) and also have continents. For the bitinformation algorithm it's essential that adjacent array entries are also adjacent in physical space. For such a grid you are showing, you'd use ideally some space-filling curve to arrange the data in an array. While in theory it is possible to adapt the algorithm to masked values, I'm still debating how to do this best and as of v0.4 it's not implemented.

2) Could you share a map of a single time step of fgco2 just so that we can get an idea of the data distribution? Point being that while you say 20% is not acceptable, how do you know that your error is smaller than that? Looking at the bitwise real information I'd assume that adjacent grid points of fgoc2 rarely lie within a factor of 2, hence the question why you believe that the uncertainty in your dataset is much lower than the variability you have going from one grid point to the next? Roughly speaking this is how we define the bitwise information. Please keep 5 mantissa bits if that's better for your use cases, but the idea of the information analysis is that you don't have to make such a subjective choice. Hence the encouragement to find an objective reason why 5 mantissa bits are important.

3) Note that due to floating-point numbers and round to nearest, preserving 1 mantissa bits bounds the relative error to 25%. Due to 1 mantissa bit one representable number is at most x1.5 (or divided by 1.5) away from another, and due to round to nearest the error is at most half of that. For 5 mantissa bits the relative error is bounded to 2^-6, i.e. ~1.5%.

aaronspring commented 2 years ago

The data I analyze is the first and the last year of a 50-year simulation with monthly output. Here's one variable to download. The grid is curvilinear, i.e. the coordinate longitude has depends on x and y. But this doesnt mean the data is irregular. adjacent array entries are also adjacent in physical space (I can plot the data with xarray not using lon or lat but only x and y what you see in my plots). The grid has two poles (Greenland and Antartica).

dimensions:
    time = UNLIMITED ; // (24 currently)
    x = 256 ;
    y = 220 ;
    vertices = 4 ;
    depth = 1 ;
    depth_2 = 1 ;
    depth_3 = 1 ;
variables:
    double time(time) ;
        time:standard_name = "time" ;
        time:units = "hours since 1850-01-31 23:15:00" ;
        time:calendar = "proleptic_gregorian" ;
        time:axis = "T" ;
    double lon(y, x) ;
        lon:standard_name = "longitude" ;
        lon:long_name = "longitude" ;
        lon:units = "degrees_east" ;
        lon:_CoordinateAxisType = "Lon" ;
        lon:bounds = "lon_bnds" ;
    float atmco2(time, depth, y, x) ;
        atmco2:standard_name = "atmospheric_co2" ;
        atmco2:long_name = "Atmospheric CO2" ;
        atmco2:units = "ppm" ;
        atmco2:code = 61 ;
        atmco2:coordinates = "lat lon" ;
        atmco2:_FillValue = -9.e+33f ;
        atmco2:missing_value = -9.e+33f ;

I have a static land-sea mask. When I drop all NaNs in preprocessing, the error is drastically reduced, see https://gist.github.com/aaronspring/c3d646729c9dfb441ced22d580913116 (also for original and rounded individual timestamp maps). However, with the stacking, I am not 100 sure whether adjacent array entries are also adjacent in physical space. With xarray I drop NaNs and still keep four input dims:

ds=xr.open_dataset(i)
ds=ds.stack(d=ds.dims)
ds['d']=np.arange(ds['d'].size)
ds=ds.dropna('d')
ds=ds.squeeze()
ndims=len(ds.dims)
for i in range(4):
    if len(ds.dims)<4:
        ds=ds.expand_dims(f'extra_{i}')
assert len(ds.dims)==4
ds.to_netcdf(o)

2.

Could you share a map of a single time step of fgco2 just so that we can get an idea of the data distribution?

see https://owncloud.gwdg.de/index.php/s/eRUA8rSTNjgd4Et for original netcdf file

Point being that while you say 20% is not acceptable, how do you know that your error is smaller than that?

New analysis on the input data without NaN. I calc the absolute error diff('method') or relative error smape in https://gist.github.com/aaronspring/c3d646729c9dfb441ced22d580913116. Now going for 100% real information yields keepbits["fgco2"]=0, the relative error is well below 1%, which is at least much more acceptable compare to before. Well what is acceptable, dunno (A brute force method would do bitrounding, check whether the error is acceptable (below 1% or 0.1%) and if not keep one more bit and redo until condition is met - but lets stick to BitInformation.jl 💯 ).

  1. Ok. That an easy heuristic.
milankl commented 2 years ago

Thanks for sharing the data, I see several problems with it: Figure_1

  1. Zooming into the Fram Strait. I don't know which interpolation you've used here, but for some reason the original grid leaves artifacts on the grid that's used for the netcdf variable. If you throw this array into bitinformation you'll get something weird as one original grid cell has been blown up to be represented with several grid cells on the new grid. Hence many grid cells on the new grid are bitwise identical (within the yellow rectangle)
julia> bitstring.(fgco2[118:120,16:25,1],:split)
3×10 Matrix{String}:
 "0 01100001 11110110001110000011100"  "0 01100001 11110110001110000011100"  …  "0 01100001 11110110001110000011100"
 "0 01100001 11110110001110000011100"  "0 01100001 11110110001110000011100"     "0 01100001 11110110001110000011100"
 "0 01100001 11110110001110000011100"  "0 01100001 11110110001110000011100"     "0 01100001 11110010100010001111001"

This creates an artificial mutual information as is discussed in the NatCompSci paper. Solution to that would be to actually use the data on the original grid that's apparent in your plots.

  1. Whether you have NaNs or you replace them with -9f33 as it's in the nc file you shared doesn't really matter. The bitinformation algorithm just cares about the bitwise representation of the floats. It does not interpret floats as floats but only as bits. The actual problem is that NaN/-9f33 are a constant bitpattern for many adjacent grid cells introducing an artificial mutual information similar to point 1. Cutting out a land-free rectangle from fgco2 in the southern ocean between South Africa and Antarctica fgco2[111:190,158:202,:] gives you 3 mantissa bits with information (bounds the rel error to 6%). That'll give you compression factors of about 8 relative to uncompressed Float32
    julia> using Blosc, BitInformation
    julia> Blosc.set_compressor("zlib")
    julia> keepbits = 3
    julia> sizeof(fgco2)/sizeof(compress(round(fgco2,keepbits)))
    8.079668129161835

    But again this has to be treated with caution given that I currently find it difficult to grasp what other effects your interpolation method has.

check whether the error is acceptable (below 1% or 0.1%)

I'm not sure I managed to get my point across: If you approach the compression of fgco2 with the prior knowledge that your error should not be higher than 1% then you don't need the bitinformation algorithm. You can directly infer that for a max 1% error you'll need 6 mantissa bits

julia> floor(-log2(0.01))
6.0

The idea of BitInformation.jl is that it can tell you how many bits you need in the case where you don't have a prior knowledge about an error or an uncertainty. That's why I ask again, how do you know that a 1% error is okay, but a 5% error is not?

Ok. That an easy heuristic.

It's not a heuristic, it's a rigid error bound as long as you stay within the normal floating point numbers

julia> floatmin(Float32),floatmax(Float32)
(1.1754944f-38, 3.4028235f38)

Floats don't guarantee you an absolute error in the same way as their spacing increases towards +-infinity. But on a logarithmic axis they are approximately equi-distant which bounds the relative error.

aaronspring commented 2 years ago
image

Here's the curvilinear grid. at the equator it is similar to a regular grid but especially the Fram strait is different. I do not do any interpolation here. This is the native MPIOM grid and while not regular, still adjacent array entries are also adjacent in physical space.

The fgco2 co2 flux figure you show has sharp edges because a) the low resolution b) how strongly fgco2 is effected by coastal upwelling (from orography and winds) and sea-ice. (couldnt exactly reproduce your plot but similar)

That's why I ask again, how do you know that a 1% error is okay, but a 5% error is not?

I dont know and cannot answer. There is no objective metric I can use here. I just play around with the numbers before on metrics I am most familiar with and decided by 👁️ .

I'm not sure I managed to get my point across: If you approach the compression of fgco2 with the prior knowledge that your error should not be higher than 1% then you don't need the bitinformation algorithm. You can directly infer that for a max 1% error you'll need 6 mantissa bits

Now, I get it.

The actual problem is that NaN/-9f33 are a constant bitpattern for many adjacent grid cells introducing an artificial mutual information similar to point 1.

And dropping NaN from the Array violates the rule "adjacent array entries are also adjacent in physical space" at the transitions. Could a graph help here? Not an expert on graphs, but can't they know their neighbors? Or is the problem how to deal with this dead ends (land)? How to you handle the poles in your atmospheric model, i.e. what are the neighbors of 90S or 90N?

milankl commented 2 years ago

I see that due to the convergence of the grid lines on Greenland, the plot you first shared has Greenland indeed entirely spread out over the top edge of the plot. Sorry, but this is not what I meant by interpolation.

image

Here you can still see some grid lines. I'm therefore postulating that, because these are fluxes between ocean and atmosphere, the atmospheric has a lower resolution than the ocean hence the atmospheric model grid has to be interpolated on the ocean grid giving us these artifacts. I currently can't see of an easy way to correct for this other than analysing the bitwise information for other regions instead. Otherwise, you'll get some artificial information.

Could a graph help here?

Not as BitInformation.jl currently implements the algorithm, the essential lines are these ones here

https://github.com/milankl/BitInformation.jl/blob/fa2dd50171a48565ef1dc5a3c7268c19ffe844f1/src/mutual_information.jl#L75-L82

From a given array A we create two views A1view, A2view which have to be of the same size (shape in numpy language) such that for every i A1view[i], A2view[i] point at adjacent entries A. Now I guess we would want to define a new method of bitinformation, something like

bitinformation(A::AbstractArray,mask::BitArray)

such that the user can pass on a mask = BitArray(undef,size(A)...) with 0s for unmasked, 1s for masked (=land points in your case). Thinking out loud one would probably then just loop over all elements in A and checking whether they are masked, like so

for i in eachindex(A)[1:end-1]
    if ~(B[i] | B[i+1])                 # if neither entry is masked
        bitcount_pair!(C,A[i],A[i+1])   # count all bits and increase counter C
    end
end

which would walk along the first dimension. For other dimensions, the array and its mask would need to be transposed/permutedimed upfront as we already do anyway. This still would come with the problem that we'd walk across the domain edges (e.g. we would count bitpairs with A[end,1] and A[1,2] as they lie adjacent in memory). Maybe there's a neat way to solve that too?

Sorry, literally thinking out loud here. I can potentially implement that in two weeks as I think it would be useful for people. Feel free to use these code snippets here and create a PR though! You are more than welcome to contribute!

aaronspring commented 2 years ago

I'm therefore postulating that, because these are fluxes between ocean and atmosphere, the atmospheric has a lower resolution than the ocean hence the atmospheric model grid has to be interpolated on the ocean grid giving us these artifacts.

yes, that’s the case.

e.g. we would count bitpairs with A[end,1] and A[1,2] as they lie adjacent in memory)

Could chunking be a problem here, i.e. if the array is spatially chunked, adjacent grid points could be separated in memory (maybe that’s only on disk).

A PR sounds good. Happy to test it :)

aaronspring commented 2 years ago

I find it most practical to run bitinformation and then check your results whether you need 0.99 0.999 0.9999 ... or 0.999999999=1 information.

aaronspring commented 2 years ago

Wouldn't bitinformation be even more powerful for 3D variables, i.e. variable X with dimensionsx, y and depth or level or height, when applying bitround based on bitinformation(dim=1, X[:,:,i]) (along x) by individual level/depth/height=i?

milankl commented 2 years ago

Yes, nothing stops you from assessing the bitinformation for subsets of your data arrays. Whether this is in the vertical, or say you want to use a different number of keepbits for the tropics vs high/mid-latitudes. The beauty of the bitrounding is that it is indeed possible to have different precisions within the same array. However, I was always afraid to open up this door because there's just too many ways how one could do it. I believe you'd need a good prior justification to do so, which, in turn, goes against the idea that you want to algorithm to tell you the required precision for some data, without particular knowledge about the data or its uncertainties. Hence, yes it may be more powerful, but the analysis also gets vastly more complex.

Having said that, a variable like nitrogen dioxide concentration is subject to very different dynamics at different vertical levels (industry & ship tracks at the surface, turbulence and radiation higher up in the atmosphere), so if output is already stored on independent vertical levels, a vertically varying keepbits might be a good choice. However, imagine you have 500 variables to output on 100 levels each, that's a lot of keepbits compression parameters to preanalyse and store.

aaronspring commented 2 years ago

Usage suggestion: When the data has an oscillation (seasonality or daily cycle), the training period for bitinformation should include the extremes of that cycle and the threshold level .99 ... 0.999999 should probably be increased with respect to non-oscillatory data when the oscillation isn't averaged before the analysis.

milankl commented 2 years ago

Can you provide an example for why that's an issue that requires special treatment?

aaronspring commented 2 years ago

I was a bit to fast in my previous post: Hard to explain subtle differences consistently across many variables. seasonality increases the range of values, not necessarily the real bitwise information content.

image

Sure, I get different outcomes for the 99% and 100% cutoff bits.

Full and not conclusive analysis: https://gist.github.com/aaronspring/6d949097c7f06b6e7a10c161b60f4e50

milankl commented 2 years ago

No seasonality means you've used the same data as before, but subtracted a mean seasonal cycle? But with and without seasonal cycle this is the bitwise information in longitude?

Could you further clarify what conceptual model you are using to justify that the information with and without the seasonal cycle should be the same? It sounds as if you are saying information(anomaly) = information(anomaly + seasons) should hold, but that's not true on the bitlevel, even if you think of seasons being some constant. For example, using ˚C or K will shift the bitwise information content across bit positions. There's an example in the paper that illustrates that

image

For a given autocorrelation in your data, adding or removing a constant will shift the information across bit positions. However, the compressibility does not change as much as you may think, but it's also not identical. You can think of this as moving information from the exponent bits into the mantissa bits makes the compression of the exponent bits easier, but the compression of the mantissa bits harder.

In your example I'm actually quite surprised/happy to see how robust the method is against the removal of a seasonal cycle (which supports the conceptual idea that the information of the seasonal cycle is negligible compared to the information of the anomalies). Given that you end up with a discrete number of keepbits in the analysis, having +-1 when using the de-seasonalised data could also be just part of the uncertainty (as in, repeat the analysis with an independent other sample of your data and you may get a somewhat different result).

The exception being that dissicos keepbits for 100% preserved information jump from 12 to 19, I agree. However, note that this might also be just a consequence of uncertainty: Information/entropy is by definition a non-negative quantity; in practice, this means that even for two fully independent variables X,Y the mutual information between them is not zero, but somewhat larger when estimated. In order to filter out the non-significant information bitinformation calls the set_zero_insignficant! function, which sets all information below some threshold to zero. That threshold depends on the number of elements in your data and a confidence level, which also means that there's a non-zero chance the tiny bit of information you are seeing between 99 and 100% distributed across 11 mantissa bits (from 9 to 19 incl) is actually just insignificant. Subtracting the seasonal cycle you may have just lifted that tiny bit of information above the confidence threshold therefore pushed the 12 keepbits to 19. That's another reason why I usually went for 99% information, which is much more robust. The last 1% of information is often spread across so many bits that it's not easy to estimate where to draw the line.

On a more practial note, you can run bitinformation(...,set_zero_insignificant=false) and then preserve 99.99% or so, maybe that gives you a more robust estimate of your keepbits for 100% information. The reason you wouldn't use 100% is that there's inevitably a tiny amount of insignificant but necessarily positive information in the very trailling bits, to exclude some of that at least.

aaronspring commented 2 years ago

Short clarification: I compare 49 January data as no seasonality against 49 consecutive months as seasonality. No anomalies taken.

So climate states and data is different. Maybe I should also try the full dataset raw and with seasonal cycle removed.

Could you further clarify what conceptual model you are using to justify that the information with and without the seasonal cycle should be the same?

No I dont think they should be the same.

That's another reason why I usually went for 99% information, which is much more robust. The last 1% of information is often spread across so many bits that it's not easy to estimate where to draw the line.

I like that justification for 99% or at least not to go for 100%=0.999999999.