Closed rabernat closed 9 years ago
In fact I just found a netCDF issue on this topic! Apparently they don't think it should be supported. Unidata/netcdf4-python#442
@rabernat -
Yes - this is all coming from the netCDF4.netcdftime
module.
The work around with xray is to use ds = xray.open_dataset(filename, decode_times=False)
then to fix up the time variable "manually". You can use xray.decode_cf()
or simply assign a new pandas time index to your time variable.
As an aside, I also work with CESM output and this is a common problem with its netCDF output.
@jhamman consider using cf_units in xray :wink:
(See https://github.com/Unidata/netcdf4-python/issues/442#issuecomment-129059576)
The PR above fixes this issue. However, since my model years are in the range 100-200, I am still getting the warning
RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
and eventually when I try to access the time data, an error with a very long stack trace ending with
pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:7638)()
pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:21232)()
pandas/tslib.pyx in pandas.tslib._check_dts_bounds (pandas/tslib.c:23332)()
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 100-02-01 00:00:00
I see there is a check in conventions.py that the year has to lie between 1678 and 2226. What is the reason for this?
We try to cast all the time variables to a pandas time index. This gives xray the ability to use many of the fast and fancy timeseries tools that pandas has. One consequence of that is that non-standard calendars, such as the "noleap" calendar must have dates inside the valid range of the standard calendars (1678 and 2226).
Does that make since? Ideally, numpy and pandas would support custom calendars but they don't so, at this point, we're bound to there limits.
@jhamman Thanks for the clear explanation! One of the main uses for non-standard calendars would be climate model "control runs", which don't occur any any specific point in historical time but still have seasonal cycles, well defined months, etc. It would be nice to have "group by" functionality for these datasets. But I do see how this is impossible with the current numpy datetime64 datatype. Perhaps the long term fix is to implement non-standard calendars within numpy itself.
Perhaps the long term fix is to implement non-standard calendars within numpy itself.
I agree, although that sounds like quite an undertaking. Maybe raise an issue over at numpy and ask if they would be interested in a multi-calendar api? If numpy could make it work, then I'm sure pandas could as well.
In case anyone is still struggling with the CESM POP time units convention, with the new CF support of version 0.12 the problem is (almost) solved.
I have slightly different CESM POP netcdf output with time attributes {'units': 'days since 0-1-1 00:00:00', 'calendar': '365_day'}
and crucially a dimension d2
(without coordinates) that trips up the decode_cf
function.
import xarray as xr # version >= 0.12
ds = xr.open_dataset('some_CESM_output_file.nc', decode_times=False)
ds = ds.drop_dims(['d2'])
ds = xr.decode_cf(ds, use_cftime=True)
Now the xarray Dataset has a cftime.DatetimeNoLeap
type time coordinate.
Could you provide the output of ncdump -h
or ds.info()
on an example file?
Thanks -- in looking at the metadata it seems there is nothing unusual about the 'd2'
dimension (in normal circumstances we should be able to decode N-D variables to dates, regardless of their type).
My feeling is that the issue here remains the fact that cftime dates do not support year zero (see the upstream issue @rabernat mentioned earlier: Unidata/netcdf4-python#442). That said, it's surprising that dropping the 'time_bounds'
variable seems to be a workaround for this issue, because the 'time'
variable (which remains in the dataset) still has units with a reference date of year zero.
If you don't mind, could you provide me with two more things?
ds = xr.open_dataset('some_CESM_output_file.nc')
?Opening the file as ds = xr.open_dataset('some_CESM_output_file.nc', decode_times=False)
the time coordinate ds.time
is at first simply an array of floats:
<xarray.DataArray 'time' (time: 1)>
array([73020.])
Coordinates:
* time (time) float64 7.302e+04
Attributes:
long_name: time
units: days since 0-1-1 00:00:00
bounds: time_bnds
calendar: 365_day
standard_name: time
axis: T
and after decoding xr.decode_cf(ds, use_cftime=True).time
returns
<xarray.DataArray 'time' (time: 1)>
array([cftime.DatetimeNoLeap(200, 1, 21, 0, 0, 0, 0, 3, 21)], dtype=object)
Coordinates:
* time (time) object 0200-01-21 00:00:00
Attributes:
long_name: time
bounds: time_bnds
standard_name: time
axis: T
The traceback of opening the file without decode_times=False
complains about year 0 being outside the range of the Gregorian or Julian calendars:
Great that's helpful, thanks. I see what's happening now. There's a lot of tricky things going on, so bear with me.
Let's examine the output from ds.info()
related to the time bounds and time variables:
float64 time_bound(time, d2) ;
time_bound:long_name = boundaries for time-averaging interval ;
time_bound:units = days since 0000-01-01 00:00:00 ;
float64 time(time) ;
time:long_name = time ;
time:units = days since 0-1-1 00:00:00 ;
time:bounds = time_bnds ;
time:calendar = 365_day ;
time:standard_name = time ;
time:axis = T ;
There are a few important things to note:
'time_bound'
and 'time'
variables, the units attribute contains a reference date with year zero.'time'
has a calendar attribute of '365_day'
, while a calendar attribute is not specified for the 'time_bound'
.'time'
has a 'bounds'
attribute that points to a variable named 'time_bnds'
instead of 'time_bound'
.For non-real-world calendars (e.g. 365_day), reference dates in cftime should allow year zero. This was fixed upstream in https://github.com/Unidata/netcdf4-python/pull/470. That being said, because of (2), the calendar for 'time_bound'
is assumed to be a standard calendar; therefore you get this ValueError
when decoding the times:
ValueError: zero not allowed as a reference year, does not exist in Julian or Gregorian calendars
Ultimately though, with https://github.com/pydata/xarray/pull/2571, we try to propagate the time-related attributes from the time coordinate to the associated bounds coordinate (so in normal circumstances we would use a 365_day calendar in this case as well). But, because of (3), this is not possible due to the fact that the 'bounds'
attribute on the 'time'
variable points to a variable name that does not exist.
In theory, another possible way to work around this would be to open the dataset with decode_times=False
, add the appropriate calendar attribute to 'time_bound'
, and then decode the times:
ds = xr.open_dataset('some_CESM_output_file.nc', decode_times=False)
ds.time_bound.attrs['calendar'] = ds.time.attrs['calendar']
ds = xr.decode_cf(ds, use_cftime=True)
Now, this may still not work depending on the values in the the 'time_bound'
variable (i.e. if any are less than 365.), because cftime currently does not support year zero in date objects (even for non-real-world calendars). I think one could make the argument that this is inconsistent with allowing reference dates with year zero for those date types, so it would probably be worth opening an issue there to try and get that fixed upstream.
In conclusion, I'm afraid there is nothing we can do in xarray to automatically fix this situation. Issue (3) in the netCDF file is particularly unfortunate. If it weren't for that, I think all of these issues would be possible to work around, e.g. with https://github.com/pydata/xarray/pull/2571 here, or with fixes upstream.
Now, this may still not work depending on the values in the the 'time_bound' variable (i.e. if any are less than 365.), because cftime currently does not support year zero in date objects (even for non-real-world calendars). I think one could make the argument that this is inconsistent with allowing reference dates with year zero for those date types, so it would probably be worth opening an issue there to try and get that fixed upstream.
I opened an issue in cftime regarding this: https://github.com/Unidata/cftime/issues/114.
It's important to be clear that the issues 2 and 3 that @spencerkclark pointed out are objectively errors in the metadata. We have worked very hard over many years to enable xarray to correctly parse CF-compliant dates with non-standard calendars. But xarray cannot and should not be expected to magically fix metadata that is inconsistent or incomplete.
You really need to bring these issues to the attention of whoever generated some_CESM_output_file.nc
.
@rabernat , it is not clear to me that issue 2 is an objective error in the metadata.
The CF conventions section on the bounds
attribute states:
Since a boundary variable is considered to be part of a coordinate variable’s metadata, it is not necessary to provide it with attributes such as
long_name
andunits
.Boundary variable attributes which determine the coordinate type (
units
,standard_name
,axis
andpositive
) or those which affect the interpretation of the array values (units
,calendar
,leap_month
,leap_year
andmonth_lengths
) must always agree exactly with the same attributes of its associated coordinate, scalar coordinate or auxiliary coordinate variable. To avoid duplication, however, it is recommended that these are not provided to a boundary variable.
I conclude from this that software parsing CF metadata should have the variable identified by the bounds
attribute inherit the attributes mentioned above from the variable with the bounds
attribute. @spencerkclark describes this as a work around. One could argue that based on the CF conventions text, xarray would be justified in dong that automatically.
However, this is confounded by issue 3, that time.attrs.bounds /= 'time_bound'
, which I agree is an error in the metadata. As a CESM-POP developer, I'm surprised to see that. Raw model output from CESM-POP has time.attrs.bounds = 'time_bound'
. So it seems like something in a post-processing workflow has the net effect of changing time.attrs.bounds
, but is preserving the name of the variable bounds
. That is problematic.
If CESM-POP were to adhere more closely to the CF recommendation in this section, I think it would drop time_bound.attrs.units
, not add time_bound.attrs.calendar
. But I don't think that is what you are suggesting.
@klindsay28 -- thanks for the clarification. You're clearly right about 2, and I was misinformed. The problem is that 3 makes it impossible follow the CF convention rules to overcome 2 (which xarray would try to do).
@AJueling , do you know the provenance of the file with time.attrs.bounds /= 'time_bound'
? If that file is being produced by an NCAR or CESM supplied workflow, then I am willing to see if the workflow can be corrected to keep time.attrs.bounds = 'time_bound'
. With this mismatch, it seems hopeless for xarray to automatically figure out how to handle this file as it was intended to be handled.
Thank you all for the clarification! I will get in touch with the person who ran the model and get back to you as soon as possible.
I'm also getting the same error:ValueError: unable to decode time units 'months since 1955-01-01 00:00:00' with 'the default calendar'. Try opening your dataset with decode_times=False or installing cftime if it is not installed.
I am trying to use xray with some CESM POP model netCDF output, which supposedly follows CF-1.0 conventions. It is failing because the models time units are "'days since 0000-01-01 00:00:00". When calling open_dataset, I get the following error:
Full metadata for the time variable:
I guess this is a problem with the underlying netCDF4 num2date package?