pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
162 stars 25 forks source link

fill values are not preserved in rechunking. #132

Closed flamingbear closed 1 year ago

flamingbear commented 1 year ago

This is a follow on from the previous report, #131.

I also noticed that when looking at diffs between the source.zarr and target.zarr after running through rechunker that the fill_values are not preserved. Below is basically same script as #131 with an additional call to consolidate_metadata

If you run this script you will see the fillvalue of "foo/bar/.zarray" changes from "fill_value": 1.0, to "fill_value": null, between the source and target zarr stores.

Thanks, Matt

import zarr
from rechunker import rechunk
import shutil

def run_create_input_store():
    shutil.rmtree('testoutput/', ignore_errors=True)
    store = zarr.DirectoryStore('testoutput/source.zarr')
    root = zarr.group(store=store, overwrite=True)
    foo = root.create_group('foo')
    root.attrs['description'] = 'root description'
    foo.attrs['description'] = 'foo description'
    bar = foo.ones('bar', shape=(10, 10))
    bar[5, 5] = 3
    bar.attrs['description'] = 'foo description'
    zarr.consolidate_metadata(store)

def rechunkit():
    openstore = zarr.open_consolidated('testoutput/source.zarr')
    array_plan = rechunk(openstore, {'foo/bar': (5, 5)},
                         '1MB',
                         'testoutput/target.zarr',
                         temp_store='testoutput/temp.zarr')
    array_plan.execute()
    zarr.consolidate_metadata('testoutput/target.zarr')

if __name__ == '__main__':
    run_create_input_store()
    rechunkit()
    print('Compare the .zmetadata files in both your source.zarr and target.zarr directories')
    print('You will see that the "fill_value" in the source is 1.0 and it is null in the target.')
    source = zarr.open('testoutput/source.zarr')
    target = zarr.open('testoutput/target.zarr')
    print(source['foo']['bar'].fill_value)
    print(target['foo']['bar'].fill_value)
flamingbear commented 1 year ago

Maybe the fill value is put into the data? The grids themselves are looking the same. I will check my other "real" output.

flamingbear commented 1 year ago

I'm closing this and will re-open a different one, relating to the same issue, but that shows the problem better.