@rsignell-usgs asked: How small could I make the chunks if I wanted to recalculate for every chunk of a dataset?
Answer based on the statistics of random information:
TL;DR: Use n>1280 data points to calculate the keepbits of a 32-bit number format for only a small chance that the estimated bits of information are significantly higher than the true real information.
Long story, definitions:
random information is the information that is created because, by chance, a finite bitstream of length n can have non-zero entropy as the occurrences of 0,1 bits isn’t exactly n/2,n/2 times. In other words: There’s a non-zero chance that 10 coin flips produce more or less than 5 tails. Each deviation from 50/50 tails/head will create entropy, such that random information causes a higher estimate of the true real bitwise information. Think about the real information + false information = entropy, but the random information is the (positive only) error bar on the real information.
assumptions:
32-bit number format with 1 bit containing 1 bit of information that’s 100% of true real information. (This is a worst case assumption: 1 bit of information spread across several bits would reduce the number of bits which can introduce random information. So does the relative contribution from other bit positions if there’s already more than 1 bit of real information)
We don’t want the expected random information to add more than q=0.001 i.e. 0.1% of information (one magnitude lower than the 99% we use for keepbits)
Then:
If for a given bit we use a 1% confidence interval for that bit to introduce random information then for 31 bits, there’s
julia> p = 1-(0.99^31)
0.26769663034560265
a p = 0.27... , i.e. ~27% chance that any bit position introduces random information. The acceptable expected random information is q = 0.001 in total, so q = p*h with h being the entropy a given bit position can add as random information:
julia> h = q/p
0.0037355718624809604
this h is reached once there’s at least about n=1280 data points
julia> n = 1280
julia> BitInformation.binom_free_entropy(n,confidence)
0.0037423512353189636
@observingClouds this ☝🏼 is also relevant for our ambitions to make keepbits spatio-temporally variable.
@rsignell-usgs asked: How small could I make the chunks if I wanted to recalculate for every chunk of a dataset?
Answer based on the statistics of random information:
TL;DR: Use n>1280 data points to calculate the keepbits of a 32-bit number format for only a small chance that the estimated bits of information are significantly higher than the true real information.
Long story, definitions:
assumptions:
q=0.001
i.e. 0.1% of information (one magnitude lower than the 99% we use for keepbits)Then: If for a given bit we use a 1% confidence interval for that bit to introduce random information then for 31 bits, there’s
a
p = 0.27...
, i.e. ~27% chance that any bit position introduces random information. The acceptable expected random information isq = 0.001
in total, soq = p*h
withh
being the entropy a given bit position can add as random information:this h is reached once there’s at least about n=1280 data points
@observingClouds this ☝🏼 is also relevant for our ambitions to make keepbits spatio-temporally variable.