milankl / BitInformation.jl

Information between bits and bytes.
MIT License
30 stars 3 forks source link

Bitinformation of masked data #29

Open milankl opened 2 years ago

milankl commented 2 years ago

As requested by @aaronspring in #25, we need to extend the bitinformation function with a method for masked arrays. This is an issue to discuss a potential implementation of this and to discuss progress. The methods we would need are

function bitinformation(A::AbstractArray{NF},    # input array A
                        mask::BitArray,          # mask of A
                        kwargs...) where {T<:Union{Integer,AbstractFloat}}

Then I further suggest separating out the permutation for information along dimension dim,

function roll_dim_forward(A::AbstractArray,dim::Int)

which is basically this bit https://github.com/milankl/BitInformation.jl/blob/fa2dd50171a48565ef1dc5a3c7268c19ffe844f1/src/mutual_information.jl#L66-L73

and moving the zeroing of insignificant information into its own function, like so

function set_insignificant_zero!(H::Vector,nelements::Int,confidence::Real)

which should basically include these lines https://github.com/milankl/BitInformation.jl/blob/fa2dd50171a48565ef1dc5a3c7268c19ffe844f1/src/mutual_information.jl#L48-L54

milankl commented 2 years ago

This issues will be addressed in #30

milankl commented 2 years ago

I've added more tests in #30 and merged the PR as all of them pass. I'd still love to see the impact if you could share your results @aaronspring at some point. Many thanks

milankl commented 2 years ago

Originally posted by @aaronspring in https://github.com/milankl/BitInformation.jl/issues/30#issuecomment-1080539133

@milankl I find differences whether using mask or not, specially the first bits and within 99-100% information:

dim: 4 time image from https://gist.github.com/aaronspring/5de0bc6be5a8d547f3503ff8b1aef8c6

dim: 1 x image from https://gist.github.com/aaronspring/7b0675e36467e5c647f2fe3f546d4bf5

dim analysis: image from https://gist.github.com/aaronspring/52662dd885eebfd6ce4b88939be016c6

milankl commented 2 years ago

Thanks Aaron, that looks awesome. Good to see that all these information patches in the exponent bits disappear! For dissicos could you convert your arrays to signed exponent bits? It looks like there's a bunch of exponent bits that simultaneously flips over (which happens when your data covers a range across floating-point 2)

julia> A = rand(Float32,3,3)
3×3 Matrix{Float32}:
 0.408687   0.922863  0.0622634
 0.838222   0.947521  0.141393
 0.0944574  0.991588  0.576514

julia> signed_exponent(A)
3×3 Matrix{Float32}:
 13.078    7.3829   127.515
  6.70577  7.58017   18.0983
 48.3622   7.93271    4.61211

It looks wrong, because Julia will interpret the exponent bits still as being biased (the information that the exponent bits are now to be interpreted differently is not stored), but you can always check that nothing went wrong by applying the inverse biased_exponent. If you use signed_exponent as a preprocessing step, note that also your mask will change

julia> signed_exponent([-9e-33])
1-element Vector{Float64}:
 -4.739053125085073e32

Originally posted by @milankl in https://github.com/milankl/BitInformation.jl/issues/30#issuecomment-1080575806

aaronspring commented 2 years ago

as you say, signed_exponent added makes the black bar in dissicos disappear

by default 2022-03-28 at 15 21 26
milankl commented 2 years ago

Perfect, that looks really good to me. Now I'd keep these mantissa bits and then redo any analysis that you'd otherwise do with the full precision data set. Feel free to post any of that here.

We haven't addressed the interpolation problem due to the ocean-atmosphere coupling, I'd be curious whether this has any consquences.

aaronspring commented 2 years ago

could you point me to an example how to save a NetCDF inside julia with one of your compressions? I only find sizeof but no writing to disk

milankl commented 2 years ago

Use NetCDF.jl, but switch lossless compression on, like here. If you then pass on arrays with rounded mantissas via round(::Array,keepbits::Int) from BitInformation.jl you'll get it accordingly compressed. Alternatively, use nco like

ncks --baa=8 --ppc varname=3 file.nc file_compressed.nc

where baa=8 is bitrounding, varname=3 means 3 mantissa bits to keep, varname is the name of the nc variables, then input and output file. See this thread too https://github.com/nco/nco/issues/250#issuecomment-1005268313

milankl commented 2 years ago

Branch #main now contains bitinformation(::AbstractArray{T};masked_value::T) where T so that you should be able to do directly

julia> using BitInformation

julia> A = rand(Float32,3,3);

julia> round!(A,1)
3×3 Matrix{Float32}:
 0.5   0.1875    0.25
 1.0   0.046875  0.0625
 0.25  0.5       0.75

julia> bitinformation(A,masked_value=0.5f0)

note that masked_value has to be of the same eltype as the array. That means if eltype(A)=Float32 then masked_value=0.5 will throw an error. However, I've used === to create the mask, so once #34 is merged you can even have NaNs in your data and simply use bitinformation(A,masked_value=NaN) (or NaN32) although == comparison for NaN always yields false 😉

milankl commented 2 years ago

v0.5 got merged into the Julia registry now. Will leave this issue open to eventually pick some of the posts here for the documentation.

aaronspring commented 2 years ago

Now I'd keep these mantissa bits and then redo any analysis that you'd otherwise do with the full precision data set. Feel free to post any of that here.

now with the masking enabled, I am quite happy with the bitrounding results: compared mean and std on monthly and annualized output as well as variance-weighted mean period as my variability analysis. https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574

milankl commented 2 years ago

https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574

Many thanks for sharing this. Makes me very happy to see that (once we've sorted out the masked array issue, which I wanted to do for a long time anyway!) we basically have another example where the bitwise real information content-approach just works. Sure you are left with the choice of 99% ... to 100% of information, but I believe that this will always be the case and in general dependent on your storage requirements. E.g. is 6x compression enough or will 10x be a game changer (because you can store more ensemble members, a higher temporal resolution, etc)? Is it data that should be kept for archiving purposes but won't be touched much?

If you have any other remarks, suggestions, concerns etc. feel free to reach out here!

aaronspring commented 2 years ago

Is it data that should be kept for archiving purposes but won't be touched much?

I just need to archive some data and probably wont touch it again.

But I think the potential for bitrounding is enormous. I will soon try it on high spatial resolution data. I expect much higher compression gains there. I am going to pitch my results in our ocean group meeting in early may.

milankl commented 2 years ago

But I think the potential for bitrounding is enormous.

I agree 😄 Especially once netcdf supports more lossless codecs, I believe zstd is soon-ish going to be supported (discussion here https://github.com/Unidata/netcdf-c/discussions/2227) which likely yields even higher compression factors in the "archiving won't touch much" case, or higher performance in the "we'll read this data every day" case.