milankl / BitInformation.jl

Information between bits and bytes.
MIT License
33 stars 3 forks source link

Smallest chunk based on statistics of random information #45

Open milankl opened 1 year ago

milankl commented 1 year ago

@rsignell-usgs asked: How small could I make the chunks if I wanted to recalculate for every chunk of a dataset?

Answer based on the statistics of random information:

TL;DR: Use n>1280 data points to calculate the keepbits of a 32-bit number format for only a small chance that the estimated bits of information are significantly higher than the true real information.

Long story, definitions:

assumptions:

Then: If for a given bit we use a 1% confidence interval for that bit to introduce random information then for 31 bits, there’s

julia> p = 1-(0.99^31)
0.26769663034560265

a p = 0.27... , i.e. ~27% chance that any bit position introduces random information. The acceptable expected random information is q = 0.001 in total, so q = p*h with h being the entropy a given bit position can add as random information:

julia> h = q/p
0.0037355718624809604

this h is reached once there’s at least about n=1280 data points

julia> n = 1280
julia> BitInformation.binom_free_entropy(n,confidence)
0.0037423512353189636

@observingClouds this ☝🏼 is also relevant for our ambitions to make keepbits spatio-temporally variable.

rsignell-usgs commented 1 year ago

Well, this reviewer buys the argument! So if my chunks are dimensioned [144,175,175], that qualifies, right? Awesome work @milankl !