Open adriaat opened 3 months ago
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
I can't reproduce this in my environment, I get the expected masked values (thanks for the code snippet, though).
Can you please add the output of xr.show_versions()
to the original post? I suspect this issue to be caused by a outdated, or possibly broken, environment.
Done. I also tested that the problem persist as of 2024-09-16, at least with my setup, by creating a new conda environment, installing ipython, xarray, numpy, dask, and netCDF4 from the conda-forge channel, and executing the snippet in this clean environment.
@adriaat I have problems understanding your use case, please bear with me. If you have NaN in your data, why do you want to apply an additional mask? Do I get it correct, that you want something like that:
import dask.array as da
import numpy as np
import xarray as xr
ds = xr.DataArray(
da.stack(
[da.from_array(np.array([[np.nan, np.nan], [np.nan, 2]])) for _ in range(10)], axis=0
).astype('float32'), dims=('time', 'lat', 'lon')).to_dataset(name='mydata')
ds.mean('time').to_netcdf('foo.nc', encoding=dict(mydata=dict(_FillValue=np.float32(1e20))))
This will actually apply your wanted _FillValue
to be used as on-disk value using CF attribute.
Update: added np.float32 wrapper
@kmuehlbauer I may have confused you with my simple snippet, which I constructed to highlight the issue. To answer your question, my snippet is a simplification of a problem I worked with, where I had a mix of NaN's and inf's; in my problem they were equivalent for what I wanted to compute. If I did not apply any additional masking and simply called ds.mean('time')
, then I would get inf
in many places where I should get a finite values. We can think of other workarounds for not using the mask, i.e. replace inf's with NaN's, if the size of the dataset allows it.
In any case, what I wanted to share (and the reason for which I opened the issue) was to highlight that when mydata
is the result of a masked array, then if it has NaNs these are saved (or read?) as 0 instead of NaNs. I might be missing something...
Using your snippet works, I think, because you define mydata
as an array, not a masked array. If you instead use a masked array in your simple example, then we recreate the behaviour of my snippet:
ds = xr.DataArray(
da.stack(
[da.from_array(np.ma.masked_array(np.array([[np.nan, np.nan], [np.nan, 2]]), np.array([[True, True], [True, False]]))) for _ in range(10)], axis=0
).astype('float32'), dims=('time', 'lat', 'lon')).to_dataset(name='mydata')
ds.mean('time').to_netcdf('foo.nc', encoding=dict(mydata=dict(_FillValue=np.float32(1e20))))
print(xr.open_dataset('foo.nc').compute())
# <xarray.Dataset>
# Dimensions: (lat: 2, lon: 2)
# Dimensions without coordinates: lat, lon
# Data variables:
# mydata (lat, lon) float32 0.0 0.0 0.0 2.0
Using your snippet works, I think, because you define
mydata
as an array, not a masked array. If you instead use a masked array in your simple example, then we recreate the behaviour of my snippet:
Thanks @adriaat, after some thinking I came to the conclusion, that there is more involved here. Digging for root-cause is way over my pay-grade. But I have one more workaround:
import dask.array as da
import numpy as np
import xarray as xr
ds = xr.DataArray(
da.stack(
[da.from_array(np.array([[np.nan, np.inf], [np.nan, 2]])) for _ in range(10)], axis=0
).astype('float32'), dims=('time', 'lat', 'lon')).to_dataset(name='mydata')
ds = xr.apply_ufunc(da.ma.masked_invalid, ds, dask='allowed')
ds = xr.apply_ufunc(da.ma.filled, ds, dask='allowed')
ds.mean('time').to_netcdf('foo.nc', encoding=dict(mydata=dict(_FillValue=np.float32(1e20))))
when
mydata
is the result of a masked array, then if it has NaNs these are saved (or read?) as 0 instead of NaNs.
JFTR, they are saved as zero (0
) on-disk. So this happens when writing out to disk. Somewhere in the computing step this went havoc. I've tried to follow the code-path, but failed.
I noticed the plan to close
label, but seems like we have an MCVE since then.
Are masked arrays something we support?
I hit an issue that can go silent for many users.
I am masking my data using
dask.array.ma.masked_invalid
(my dataset is much larger than my memory). When saving the results and loading them again, the data is changed. In particular,0.
is assigned to those elements that where masked instead ofnp.nan
orfill_value
, which is1e+20
.Below an example that illustrates the issue:
I expected
mydata
to be either[np.nan, np.nan, np.nan, 2.0]
,numpy.MaskedArray
, or[1e+20, 1e+20, 1e+20, 2.0]
, since:instead of
[0.0, 0.0, 0.0, 2.0]
. No warning is raised, and I do not get why the fill values are replaced by0.
A way to address this for my data (but might not be true for all cases) is to use
xarray.where
before writing to file asor mask the data in some other way, e.g. using xarray.where in the beginning instead of xarray.apply_ufunc.
Output of
xr.show_versions()