Open Thomas-Z opened 2 months ago
@Thomas-Z Thanks for the well written issue.
The first issue is with Timedelta decoding. If you remove the units
attribute the pipeline works. This indicates, that there is a regression in that part. I'll have a closer look the next days. One remark here, packed data can't be of type int64 (see https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#packed-data).
The second issue is non-conforming CF attribute. scale_factor
should be of unpacked type (some floating point type). If you change it to floating point it works as intended.
The second issue is non-conforming CF attribute. scale_factor should be of unpacked type (some floating point type). If you change it to floating point it works as intended.
We could cast and raise a warning. It should be OK to open a non-conforming file with xarray.
The first issue is with Timedelta decoding. If you remove the units attribute the pipeline works.
Not sure if it helps but keeping the unit and removing the fill_value makes it work too.
The second issue is non-conforming CF attribute.
scale_factor
should be of unpacked type (some floating point type). If you change it to floating point it works as intended.
Right, I was not aware of that.
Using 1000.0 as scale_factor
does work but changes the unpacked data type (to float) which is kind of disturbing to me but seems conform to the CF convention.
We could cast and raise a warning. It should be OK to open a non-conforming file with xarray.
In my example I can open the non-conforming file. I just cannot write a non-conforming file and this is maybe not a bad thing.
The first issue is with Timedelta decoding. If you remove the units attribute the pipeline works.
Not sure if it helps but keeping the unit and removing the fill_value makes it work too.
Yes, I would have thought so. The CF mask coder is only applied when _FillValue
is given. As the time decoding is after masking that leads to some issue in this case. We possibly need to special case time units in CF mask coder. But aren't we doing that already?
We could cast and raise a warning. It should be OK to open a non-conforming file with xarray.
In my example I can open the non-conforming file. I just cannot write a non-conforming file and this is maybe not a bad thing.
So, for the second case we already allow to read int64 packed into int8 (which is not CF conforming). But then it might be good to raise a more specific error on write, here (non conforming CF).
@Thomas-Z Is your issue related to #1621? Do you want your data converted to timedelta64 or keep the floating point representation?
My problem is more about the fact that we can no longer read these type of variables without setting mask_and_scale
to False (and I broke some of my colleagues CryoSat processing tools when I updated xarray :innocent: )
Decoding it as a timedelta64 with the option to disable it with decode_timedelta=False
seems ok to me but I'm not the end user of the data.
I'm not a fan of having different or hard to predict decoding behavior depending on whether it's a coordinate or a variable or if it has that specific attribute.
Simple rules (when possible) will not satisfy everyone but we will not have any surprise and we can adapt.
I'm having similar issues, but with reading a preexisting data file from Metop-C's ASCAT instrument. Maybe these files are non-conforming (I'm not sure) but the are official files from EUMETSAT.
Unless I'm misunderstanding something, though, the file appears to follow the rules regarding packed data linked by @Thomas-Z. The data are packed as an int32
and the scale factor is a float64
.
Opening the file with df = xr.open_dataset(fname)
succeeds, but I get the same error as above if I attempt to access values from df.time.values
.
> df.time.values
---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
Cell In[35], line 1
----> 1 dat.time.values
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/core/dataarray.py:784, in DataArray.values(self)
771 @property
772 def values(self) -> np.ndarray:
773 """
774 The array's data converted to numpy.ndarray.
775
(...)
782 to this array may be reflected in the DataArray as well.
783 """
--> 784 return self.variable.values
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/core/variable.py:525, in Variable.values(self)
522 @property
523 def values(self):
524 """The variable's data as a numpy.ndarray"""
--> 525 return _as_array_or_item(self._data)
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/core/variable.py:323, in _as_array_or_item(data)
309 def _as_array_or_item(data):
310 """Return the given values as a numpy array, or as an individual item if
311 it's a 0d datetime64 or timedelta64 array.
312
(...)
321 TODO: remove this (replace with np.asarray) once these issues are fixed
322 """
--> 323 data = np.asarray(data)
324 if data.ndim == 0:
325 if data.dtype.kind == "M":
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/core/indexing.py:806, in MemoryCachedArray.__array__(self, dtype)
805 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray:
--> 806 return np.asarray(self.get_duck_array(), dtype=dtype)
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/core/indexing.py:809, in MemoryCachedArray.get_duck_array(self)
808 def get_duck_array(self):
--> 809 self._ensure_cached()
810 return self.array.get_duck_array()
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/core/indexing.py:803, in MemoryCachedArray._ensure_cached(self)
802 def _ensure_cached(self):
--> 803 self.array = as_indexable(self.array.get_duck_array())
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/core/indexing.py:760, in CopyOnWriteArray.get_duck_array(self)
759 def get_duck_array(self):
--> 760 return self.array.get_duck_array()
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/core/indexing.py:630, in LazilyIndexedArray.get_duck_array(self)
625 # self.array[self.key] is now a numpy array when
626 # self.array is a BackendArray subclass
627 # and self.key is BasicIndexer((slice(None, None, None),))
628 # so we need the explicit check for ExplicitlyIndexed
629 if isinstance(array, ExplicitlyIndexed):
--> 630 array = array.get_duck_array()
631 return _wrap_numpy_scalars(array)
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/coding/variables.py:81, in _ElementwiseFunctionArray.get_duck_array(self)
80 def get_duck_array(self):
---> 81 return self.func(self.array.get_duck_array())
File ~/anaconda3/envs/test-1.12.2-release/lib/python3.10/site-packages/xarray/coding/variables.py:399, in _scale_offset_decoding(data, scale_factor, add_offset, dtype)
397 data = data.astype(dtype=dtype, copy=True)
398 if scale_factor is not None:
--> 399 data *= scale_factor
400 if add_offset is not None:
401 data += add_offset
UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('int64') with casting rule 'same_kind'
I can get around this by opening the file with df = xr.open_dataset(fname, mask_and_scale=False)
but that has obvious repercussions for the other DataArrays.
The time
DataArray looks like this:
<xarray.DataArray 'time' (NUMROWS: 3264, NUMCELLS: 82)> Size: 2MB
[267648 values with dtype=int64]
Coordinates:
lat (NUMROWS, NUMCELLS) float64 2MB ...
lon (NUMROWS, NUMCELLS) float64 2MB ...
Dimensions without coordinates: NUMROWS, NUMCELLS
Attributes:
valid_min: 0
valid_max: 2147483647
standard_name: time
long_name: time
units: seconds since 1990-01-01 00:00:00
When read with mask_and_scale=False
, the DataArray's attributes are:
{'_FillValue': -2147483647,
'missing_value': -2147483647,
'valid_min': 0,
'valid_max': 2147483647,
'standard_name': 'time',
'long_name': 'time',
'scale_factor': 1.0,
'add_offset': 0.0}
I can replicate the error by attempting to do an in-place operation on some of the time data after reading with mask_and_scale=False, decode_times=False
:
In [54]: df = xr.open_dataset(fname, mask_and_scale=False, decode_times=False)
In [55]: tmp = df.time.values[0:10, 0:10]
In [56]: tmp.dtype
Out[56]: dtype('int32')
In [57]: df.time.attrs['scale_factor'].dtype
Out[57]: dtype('float64')
In [58]: tmp *= dat2.time.attrs.get('scale_factor')
---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
Cell In[58], line 1
----> 1 tmp *= dat2.time.attrs.get('scale_factor')
UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('int32') with casting rule 'same_kind'
Am I doing something wrong? Is the file non-conformant? Is there a way to solve this issue without doing all of my own masking, scaling, and conversion to datetime?
@jsolbrig Sorry for the delay here.
The issue is with the on-disk data
'scale_factor': 1.0,
'add_offset': 0.0
This will decode the int64 into float64 before decoding times.
One solution to properly load your specific data is to remove the problematic attributes before decoding:
df = xr.open_dataset(fname, decode_cf=False)
df.time.attrs.pop("scale_factor")
df.time.attrs.pop("add_offset")
df = xr.decode_cf(df)
What happened?
Reading or writing netCDF variables containing scale_factor and/or fill_value might raise the following error:
This problem might be related to the following changes: #7654.
What did you expect to happen?
I'm expecting it to work like it did before xarray 2024.03.0!
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment