pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Seems like rechunker turned my NaN values into 1e37? #55

Open rsignell-usgs opened 3 years ago

rsignell-usgs commented 3 years ago

My zarr dataset with 32-bit floats had NaN values before rechunker, but after rechunker, the NaN values are now 1e37, which is not so nice: https://nbviewer.jupyter.org/gist/rsignell-usgs/028fc946eb6b560b3d765c4a2dabfc5d

Of course I can use where to reset to NaN, but seems like a bug.

rabernat commented 3 years ago

Thanks for the bug report Rich.

Any chance you could reduce this to a more reproducible example? We can't access your data so we can't reproduce this.

I recommend looking closely at the zarr metadata (.zarray) of the original array, then creating a synthetic array (ideally a small one) with the same exact properties.

rsignell-usgs commented 3 years ago

@rabernat, yes I can work on a reproducible example -- I was just hoping a dev would have an aha moment seeing the issue.

Would it be related to the fact that the rechunked array did not get generated with consolidated metadata like the original? (I was thinking that rechunker would write consolidated by default, but I guess not)

I could try rechunking the data again and specify consolidated metadata if I knew how to do that.

rabernat commented 3 years ago

The general issue is that different layers of this stack (xarray and zarr) have different and possibly overlapping mechanisms for dealing with missing data.

The notebook link you posted above does not contain any obvious use of rechunker or analysis of missing values. Did I miss something, or did you post the wrong link?

rsignell-usgs commented 3 years ago

@rabernat, I posted the notebook that contrasted reading the data from the original and from the rechunked data.

Here is the notebook I used to rechunk.