observingClouds / xbitinfo

Python wrapper of BitInformation.jl to easily compress xarray datasets based on their information content
https://xbitinfo.readthedocs.io
MIT License
54 stars 22 forks source link

Thoughts about the mutual information threshold #259

Open thodson-usgs opened 8 months ago

thodson-usgs commented 8 months ago

Running some tests with bitinfo codec, and it seemed like a good time to revisit whether it's better to trim mutual_information by an "arbitrary" threshold or use a free entropy threshold. The former appears to give decent results and better compression but might be losing some real information. I wanted to open the issue before submitting a PR because I assume others have dug more deeply into this.

Here's the code:

import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")

from numcodecs import Blosc, BitInfo
compressor = Blosc(cname="zstd", clevel=3)
filters = [BitInfo(info_level=0.99)]

encoding = {"air": {"compressor": compressor, "filters": filters}}

ds.to_zarr('codec.zarr', mode="w", encoding=encoding)

By default, ds.to_zarr will chunk this dataset into 730x7x27 blocks for compression.

Here are the results: Compression Size
None 17 Mb
Zstd 5.3 Mb
Zstd + Bitinfo (default tol w/ factor = 1.1) 1.2 Mb
Zstd + Bitinfo (free entropy tol) 2.8 Mb

(An additional half-baked thought: What about using a convolution to compute info content pixel-by-pixel rather than chunk-by-chunk?)

thodson-usgs commented 8 months ago

Ah, maybe aspects of this are resolved by #234

milankl commented 7 months ago

An additional half-baked thought: What about using a convolution to compute info content pixel-by-pixel rather than chunk-by-chunk?

You can technically do that, but it just gets really expensive. I agree it's a nice academic exercise to see how smooth an continuous field of keepbits would become, but in any practical sense I can imagine you'd need to read in and throw out so much memory that I'd doubt you'd be anywhere near a reasonable compression speed.

In the original paper I've experimented with calculation the bitinformation in various directions (lon first, lat first, time first, ensemble dimension first) but I generally found the information in longitude to be effectively an upper bound on the information in other dimensions. Meaning that, if you'd use information in the vertical and use that to cut down on the false information than you're ignoring additional information that you have in the longitude dimension. This obviously highly depends on the resolution you have in the various directions, e.g. a highly temporal resolution may have more information than a coarsly resolved longitude.

But overall I found that in practice you'd want to compute the bitinformation contiguously in memory, that way you can do it in a single pass and (at least BitInformation.jl) reaches about 100MB/s which is a reasonable speed people can work with. If it gets down to 1-10MB/s I see limits in any practical big data applications.

Technically you are statistically predicting the state of a bit given some predictor. For the longitude dimension that's the same bit position in the previous grid point in longitude. However you could use any predictor you like, including any bit that's anywhere in the dataset. But for practical purposes I found that you'd want your resulting joint probability matrix to be of size 2x2 because for everything else you'll need to count so many bitpair combinations that it gets easily out of hand I've seen hardly any evidence that this improves anything.

thodson-usgs commented 7 months ago

Those are good insights. I'll look for lower fruit.

Regarding my initial test. I altered the threshold to 1.1 based on performance with my dataset. Later I realized that the problem wasn't the threshold, rather the data were quantized.