milankl / BitInformation.jl

Information between bits and bytes.
MIT License
33 stars 3 forks source link

use of bitinformation(dim) #31

Closed aaronspring closed 2 years ago

aaronspring commented 2 years ago

I don't quite understand the dim argument in bitinformation and its implications. Can I just ignore it and use the default dim=1?

https://github.com/milankl/BitInformation.jl/blob/05bd9ef447fa926a85b514162b51bc0c06afa083/test/information.jl#L37-L39 seems like dim only matters for sorted dimensions, i.e. dim doesnt matter on raw data.

Your example plots in https://doi.org/10.24433/CO.8682392.v1 are using dim=1 meaning longitude. I have data along dimensions longitude, latitude and time and somehow intuitively would run the analysis along time.

milankl commented 2 years ago

That test is indeed confusing. As the array A is not sorted, every entry is independent of the next hence all those tests just check that the information is zero.

julia> using BitInformation
julia> A = rand(Float32,30,40,50);
julia> bi1 = bitinformation(A,dim=1);
julia> bi2 = bitinformation(A,dim=2);
julia> bi3 = bitinformation(A,dim=3);
julia> hcat(bi1,bi2,bi3)
32×3 Matrix{Float64}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 ⋮   

However, if you sort the array in a given dimension then you artificially introduce some information, which is highest in that dimension

julia> sort!(A,dims=1);
julia> bi1 = bitinformation(A,dim=1);
julia> bi2 = bitinformation(A,dim=2);
julia> bi3 = bitinformation(A,dim=3);
julia> hcat(bi1,bi2,bi3)
32×3 Matrix{Float64}:
 0.0          0.0          0.0
 0.0          0.0          0.0
 0.0          0.0          0.0
 0.0          0.0          0.0
 0.0          0.0          0.0
 0.0067747    0.00508132   0.00538892
 0.292094     0.182393     0.187531
 0.550684     0.265361     0.271625
 0.371526     0.114251     0.118072
 0.237596     0.0441321    0.0441709
 ⋮                         
 0.0          0.0          9.3149e-5
 0.0          0.0          0.0
 0.0          0.0          0.0
 0.0          0.0          0.0
 0.0          0.0          0.000280003
 0.000749177  0.000946589  0.000850585
 0.00515332   0.00430802   0.00508684
 0.0233246    0.0177343    0.0185884
 0.061388     0.0458432    0.0484664

bi1 will have the highest information in the exponent/mantissa bits, but sorting along 1 dimension also influences the other (with smaller information though). The information in the last mantissa bits is due to the poor sampling of rand (see the randfloat function in JuliaRandom/RandomNumbers.jl as an alternative).

milankl commented 2 years ago

I have data along dimensions longitude, latitude and time and somehow intuitively would run the analysis along time.

You can run the analysis along any dimension you like. You can also add the information. The first dimension is usually just the default because that's also how the data is layed out in memory/on disk. Things can change along different dimensions, depending on the resolution. Check the supplement of our paper for some examples.

aaronspring commented 2 years ago

is it also possible to run bitinformation on all dimensions and does that make sense?

milankl commented 2 years ago

Yes, that's the same as running it in all dimensions separately and averaging the information. As it's an arithmetic mean you'll end up in the situation that if the information is high in one dimension but low in another that you may cut off too many bits for that high-information dimension. So what I often just went for is using longitude alone. Rule of thumb that I found in our data is information is highest in longitude/time then latitude then vertical then ensemble. But that obviously depends on the spatio-temporal resolution...

aaronspring commented 2 years ago

thank you