pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

Reading data after saving data from masked arrays results in different numbers #9374

Open adriaat opened 4 weeks ago

adriaat commented 4 weeks ago

I hit an issue that can go silent for many users.

I am masking my data using dask.array.ma.masked_invalid (my dataset is much larger than my memory). When saving the results and loading them again, the data is changed. In particular, 0. is assigned to those elements that where masked instead of np.nan or fill_value, which is 1e+20.

Below an example that illustrates the issue:

In [1]: import dask.array as da
   ...: import numpy as np
   ...: import xarray as xr

In [2]: ds = xr.DataArray(
   ...:     da.stack(
   ...:         [da.from_array(np.array([[np.nan, np.nan], [np.nan, 2]])) for _ in range(
   ...: 10)],
   ...:         axis=0
   ...:     ).astype('float32'),
   ...:     dims=('time', 'lat', 'lon')
   ...: ).to_dataset(name='mydata')

In [3]: # Mask my data

In [4]: ds = xr.apply_ufunc(da.ma.masked_invalid, ds, dask='allowed') 

In [5]: ds.mean('time').compute()
Out[5]: 
<xarray.Dataset> Size: 16B
Dimensions:  (lat: 2, lon: 2)
Dimensions without coordinates: lat, lon
Data variables:
    mydata   (lat, lon) float32 16B nan nan nan 2.0

In [6]: # Write to file

In [7]: ds.mean('time').to_netcdf('foo.nc')

In [8]: # Read foo.nc

In [9]: foo = xr.open_dataset('foo.nc')

In [10]: foo.compute()
Out[10]: 
<xarray.Dataset> Size: 16B
Dimensions:  (lat: 2, lon: 2)
Dimensions without coordinates: lat, lon
Data variables:
    mydata   (lat, lon) float32 16B 0.0 0.0 0.0 2.0

I expected mydata to be either [np.nan, np.nan, np.nan, 2.0], numpy.MaskedArray, or [1e+20, 1e+20, 1e+20, 2.0], since:

In [11]: ds.mean('time')['mydata'].data.compute()
Out[11]: 
masked_array(
  data=[[--, --],
        [--, 2.0]],
  mask=[[ True,  True],
        [ True, False]],
  fill_value=1e+20,
  dtype=float32)

instead of [0.0, 0.0, 0.0, 2.0]. No warning is raised, and I do not get why the fill values are replaced by 0.

A way to address this for my data (but might not be true for all cases) is to use xarray.where before writing to file as

In [12]: xr.where(ds.mean('time'), ds.mean('time'), np.nan).to_netcdf('foo.nc')

In [13]: foo = xr.open_dataset('foo.nc')

In [14]: foo.compute()
Out[14]: 
<xarray.Dataset> Size: 16B
Dimensions:  (lat: 2, lon: 2)
Dimensions without coordinates: lat, lon
Data variables:
    mydata   (lat, lon) float32 16B nan nan nan 2.0

or mask the data in some other way, e.g. using xarray.where in the beginning instead of xarray.apply_ufunc.

welcome[bot] commented 4 weeks ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

keewis commented 2 weeks ago

I can't reproduce this in my environment, I get the expected masked values (thanks for the code snippet, though).

Can you please add the output of xr.show_versions() to the original post? I suspect this issue to be caused by a outdated, or possibly broken, environment.