Reading data after saving data from masked arrays results in different numbers

adriaat commented 3 months ago

I hit an issue that can go silent for many users.

I am masking my data using dask.array.ma.masked_invalid (my dataset is much larger than my memory). When saving the results and loading them again, the data is changed. In particular, 0. is assigned to those elements that where masked instead of np.nan or fill_value, which is 1e+20.

Below an example that illustrates the issue:

In [1]: import dask.array as da
   ...: import numpy as np
   ...: import xarray as xr

In [2]: ds = xr.DataArray(
   ...:     da.stack(
   ...:         [da.from_array(np.array([[np.nan, np.nan], [np.nan, 2]])) for _ in range(
   ...: 10)],
   ...:         axis=0
   ...:     ).astype('float32'),
   ...:     dims=('time', 'lat', 'lon')
   ...: ).to_dataset(name='mydata')

In [3]: # Mask my data

In [4]: ds = xr.apply_ufunc(da.ma.masked_invalid, ds, dask='allowed') 

In [5]: ds.mean('time').compute()
Out[5]: 
<xarray.Dataset> Size: 16B
Dimensions:  (lat: 2, lon: 2)
Dimensions without coordinates: lat, lon
Data variables:
    mydata   (lat, lon) float32 16B nan nan nan 2.0

In [6]: # Write to file

In [7]: ds.mean('time').to_netcdf('foo.nc')

In [8]: # Read foo.nc

In [9]: foo = xr.open_dataset('foo.nc')

In [10]: foo.compute()
Out[10]: 
<xarray.Dataset> Size: 16B
Dimensions:  (lat: 2, lon: 2)
Dimensions without coordinates: lat, lon
Data variables:
    mydata   (lat, lon) float32 16B 0.0 0.0 0.0 2.0

I expected mydata to be either [np.nan, np.nan, np.nan, 2.0], numpy.MaskedArray, or [1e+20, 1e+20, 1e+20, 2.0], since:

In [11]: ds.mean('time')['mydata'].data.compute()
Out[11]: 
masked_array(
  data=[[--, --],
        [--, 2.0]],
  mask=[[ True,  True],
        [ True, False]],
  fill_value=1e+20,
  dtype=float32)

instead of [0.0, 0.0, 0.0, 2.0]. No warning is raised, and I do not get why the fill values are replaced by 0.

A way to address this for my data (but might not be true for all cases) is to use xarray.where before writing to file as

In [12]: xr.where(ds.mean('time'), ds.mean('time'), np.nan).to_netcdf('foo.nc')

In [13]: foo = xr.open_dataset('foo.nc')

In [14]: foo.compute()
Out[14]: 
<xarray.Dataset> Size: 16B
Dimensions:  (lat: 2, lon: 2)
Dimensions without coordinates: lat, lon
Data variables:
    mydata   (lat, lon) float32 16B nan nan nan 2.0

or mask the data in some other way, e.g. using xarray.where in the beginning instead of xarray.apply_ufunc.

Output of xr.show_versions()


INSTALLED VERSIONS
------------------
commit: None
python: 3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:36:51) [GCC 12.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.153.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2024.9.0
pandas: 2.2.2
numpy: 2.1.1
scipy: None
netCDF4: 1.7.1
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.9.0
distributed: 2024.9.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.9.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 73.0.1
pip: 24.2
conda: None
pytest: None
mypy: None
IPython: 8.27.0
sphinx: None

welcome[bot] commented 3 months ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

keewis commented 3 months ago

I can't reproduce this in my environment, I get the expected masked values (thanks for the code snippet, though).

Can you please add the output of xr.show_versions() to the original post? I suspect this issue to be caused by a outdated, or possibly broken, environment.

adriaat commented 2 months ago

Done. I also tested that the problem persist as of 2024-09-16, at least with my setup, by creating a new conda environment, installing ipython, xarray, numpy, dask, and netCDF4 from the conda-forge channel, and executing the snippet in this clean environment.

kmuehlbauer commented 2 months ago

@adriaat I have problems understanding your use case, please bear with me. If you have NaN in your data, why do you want to apply an additional mask? Do I get it correct, that you want something like that:

import dask.array as da
import numpy as np
import xarray as xr

ds = xr.DataArray(
     da.stack(
         [da.from_array(np.array([[np.nan, np.nan], [np.nan, 2]])) for _ in range(10)], axis=0
     ).astype('float32'), dims=('time', 'lat', 'lon')).to_dataset(name='mydata')
ds.mean('time').to_netcdf('foo.nc', encoding=dict(mydata=dict(_FillValue=np.float32(1e20))))

This will actually apply your wanted _FillValue to be used as on-disk value using CF attribute.

Update: added np.float32 wrapper

adriaat commented 2 months ago

@kmuehlbauer I may have confused you with my simple snippet, which I constructed to highlight the issue. To answer your question, my snippet is a simplification of a problem I worked with, where I had a mix of NaN's and inf's; in my problem they were equivalent for what I wanted to compute. If I did not apply any additional masking and simply called ds.mean('time'), then I would get inf in many places where I should get a finite values. We can think of other workarounds for not using the mask, i.e. replace inf's with NaN's, if the size of the dataset allows it.

In any case, what I wanted to share (and the reason for which I opened the issue) was to highlight that when mydata is the result of a masked array, then if it has NaNs these are saved (or read?) as 0 instead of NaNs. I might be missing something...

Using your snippet works, I think, because you define mydata as an array, not a masked array. If you instead use a masked array in your simple example, then we recreate the behaviour of my snippet:

ds = xr.DataArray(
     da.stack(
         [da.from_array(np.ma.masked_array(np.array([[np.nan, np.nan], [np.nan, 2]]), np.array([[True, True], [True, False]]))) for _ in range(10)], axis=0
     ).astype('float32'), dims=('time', 'lat', 'lon')).to_dataset(name='mydata')
ds.mean('time').to_netcdf('foo.nc', encoding=dict(mydata=dict(_FillValue=np.float32(1e20))))
print(xr.open_dataset('foo.nc').compute())
# <xarray.Dataset>
# Dimensions:  (lat: 2, lon: 2)
# Dimensions without coordinates: lat, lon
# Data variables:
#    mydata   (lat, lon) float32 0.0 0.0 0.0 2.0

kmuehlbauer commented 2 months ago

Using your snippet works, I think, because you define mydata as an array, not a masked array. If you instead use a masked array in your simple example, then we recreate the behaviour of my snippet:

Thanks @adriaat, after some thinking I came to the conclusion, that there is more involved here. Digging for root-cause is way over my pay-grade. But I have one more workaround:

import dask.array as da
import numpy as np
import xarray as xr

ds = xr.DataArray(
         da.stack(
         [da.from_array(np.array([[np.nan, np.inf], [np.nan, 2]])) for _ in range(10)], axis=0
     ).astype('float32'), dims=('time', 'lat', 'lon')).to_dataset(name='mydata')
ds = xr.apply_ufunc(da.ma.masked_invalid, ds, dask='allowed')
ds = xr.apply_ufunc(da.ma.filled, ds, dask='allowed')
ds.mean('time').to_netcdf('foo.nc', encoding=dict(mydata=dict(_FillValue=np.float32(1e20))))

kmuehlbauer commented 2 months ago

when mydata is the result of a masked array, then if it has NaNs these are saved (or read?) as 0 instead of NaNs.

JFTR, they are saved as zero (0) on-disk. So this happens when writing out to disk. Somewhere in the computing step this went havoc. I've tried to follow the code-path, but failed.

max-sixty commented 1 month ago

I noticed the plan to close label, but seems like we have an MCVE since then.

Are masked arrays something we support?

pydata / xarray

Reading data after saving data from masked arrays results in different numbers #9374