milankl / BitInformation.jl

Information between bits and bytes.
MIT License
30 stars 3 forks source link

Discuss best practices for `xr.Dataset.to_netcdf()` #28

Open aaronspring opened 2 years ago

aaronspring commented 2 years ago

I want to use BitInformation and BitRound on NetCDF output. Anyone with experience with xarray?

What I do: see also https://github.com/milankl/BitInformation.jl/issues/25#issuecomment-1060889412

My routine:

  1. run plot_bitinformation.jl and save keepbits between 0 and 23 to a dict with json
  2. use code from https://github.com/zarr-developers/numcodecs/pull/299/files to apply BitRound for each variable
  3. ds_rounded.to_netcdf(path, encoding, engine='h5netcdf')

I use engine="h5netcdf" because I can use {"compression": True} in encoding. I do this because some compression is needed for BitRound to be effective on disk storage. (I have only very little knowledge of compression and netcdf4 basics.) But probably there is a better way. What I do not want is that the compression impacts the read performance too much, i.e. reading the compressed data should not take much more (subjective I know) time that compared to before (or is this the tradeoff anyways).

https://xarray.pydata.org/en/latest/generated/xarray.Dataset.to_netcdf.html

# https://github.com/zarr-developers/numcodecs/pull/299/files
import numpy as np
from numcodecs.abc import Codec
from numcodecs.compat import ensure_ndarray, ndarray_copy

class BitRound(Codec):
    codec_id = 'bitround'

    def __init__(self, keepbits: int):
        if (keepbits < 0) or (keepbits > 23):
            raise ValueError("keepbits must be between 0 and 23")
        self.keepbits = keepbits

    def encode(self, buf):
        if self.keepbits == 23:
            return buf
        # TODO: figure out if we need to make a copy
        # Currently this appears to be overwriting the input buffer
        # Is that the right behavior?
        a = ensure_ndarray(buf).view()
        assert a.dtype == np.float32
        b = a.view(dtype=np.int32)
        maskbits = 23 - self.keepbits
        mask = (0xFFFFFFFF >> maskbits) << maskbits
        half_quantum1 = (1 << (maskbits - 1)) - 1
        b += ((b >> maskbits) & 1) + half_quantum1
        b &= mask
        return b

    def decode(self, buf, out=None):
        data = ensure_ndarray(buf).view(np.float32)
        out = ndarray_copy(data, out)
        return out

def bitround(data, keepbits):
    codec = BitRound(keepbits=keepbits)
    data = data.copy()  # otherwise overwrites the input
    encoded = codec.encode(data)
    return codec.decode(encoded)

# my code below
import xarray as xr
import sys
print('Require three args: python BitRound_file.py ifile.nc keepbits.json ofile')
assert len(sys.argv) == 3+1
_, ifile, keepbits_json, ofile = sys.argv

import json
print("keepbits_json",keepbits_json)
keepbits_loopup = json.load(open(keepbits_json))

ds = xr.open_dataset(ifile)
ds_round = ds.copy()
for v in ds.data_vars:
    if v in keepbits_loopup.keys():
        keepbits = keepbits_loopup[v]
        # here corrections could be applied
        # some hard fixes maybe
        # if keepbits < 2: 
        #     keepbits = 5
        ds_round[v].values = bitround(ds[v].values,keepbits)

ds_round=ds_round.assign_coords(ds.coords)

encoding={v:{'compression':True} for v in ds.data_vars if v in keepbits_loopup.keys()}
print('BitRound_file.py writes ', ofile)
ds_round.to_netcdf(ofile, mode='w', engine='h5netcdf', encoding=encoding)

EDIT:

Result:


>>> ds_rounded[v].encoding
{'zlib': True, 'shuffle': False, 'complevel': 4, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 5, 28, 64), 'source': '/work/bm1124/m300524/experiments/asp_esmControl_PMassim_1850_ATMsfn_over_2006/outdata/hamocc/asp_esmControl_PMassim_1850_ATMsfn_over_2006_hamocc_data_3d_ym_18590101_18591231_comp99.nc', 'original_shape': (1, 40, 220, 256), 'dtype': dtype('float32'), '_FillValue': nan, 'coordinates': 'lon lat'}
aaronspring commented 2 years ago

Some further interesting reading for me:

milankl commented 2 years ago

What I do not want is that the compression impacts the read performance too much, i.e. reading the compressed data should not take much more (subjective I know) time that compared to before (or is this the tradeoff anyways).

Many lossless compressors have a decompression speed that's approximately independent of the compression level. So read performance is rarely something to worry about with this method. E.g. image (from https://www.nature.com/articles/s43588-021-00156-2)

aaronspring commented 2 years ago

compress and save with xarray in xbitinfo: https://xbitinfo.readthedocs.io/en/latest/api.html#save-compressed