kuchaale / X-regression

Multiple linear regression from Statsmodels library coupled with Xarray library
GNU General Public License v3.0
12 stars 3 forks source link

xarray datetime limitations #6

Closed wwicker closed 3 months ago

wwicker commented 6 years ago

Hi, xarray is using pandas DateTimeIndex for the time axis of xarrays. Due to the representation of time in nanoseconds the investigated period is limited to something like 550 years. There is much on this issue on the web. A widely recommended workaround is to use PeriodIndex instead. Unfortunately many xarray functionalities are affected by this, e.g. groupby and encoding.

I propose an adaptation like this in eofs_univ_ccmi_xarray.py and something equivalent in the other scripts.

line 96:

#use PeriodIndex instead of DatetimeIndex as time dimension due to 64 bit nanosecond limit
zm_u = zm_u.to_dataset() #that causes problems later on: use anomalies['U'] instead
zm_u = xr.decode_cf(zm_u)
periods = []
for date in zm_u['time']:
        raw_date = date.values.item()
        periods.append(pd.Period(year=raw_date.year, month=raw_date.month, day=raw_date.day, freq='M'))
zm_u['time'] = pd.PeriodIndex(periods, freq='M')
#period selection
#times = pd.date_range(str(s_year)+'-01-01', str(e_year)+'-12-31', name='time', freq = 'M')
times = pd.PeriodIndex(start=str(s_year)+'-01', end=str(e_year)+'-12', name='time', freq = 'M')
zm_u_sel = zm_u.sel(time = times, method='ffill') #nearest
#remove seasonality
months = xr.DataArray([p.month for p in zm_u_sel['time'].values],coords=[times],name='months')
climatology = zm_u_sel.groupby(months).mean('time')
anomalies = zm_u_sel.groupby(months) - climatology
anomalies = anomalies['U']

line 164:

nc_time = pcs_ds.indexes['time'].strftime('%Y-%m')
pcs_ds['time'] = nc_time
kuchaale commented 6 years ago

@wwicker, thanks for raising the issue related to timestamp limitations (limited to approximately 584 years). Your workaround may not be necessary if this xarray's issue would be solved. Otherwise, I would rather suggest the following workaround:

ds = xr.open_dataset(in_netcdf, decode_times=False)
ntime = ds.time.shape[0]
times = pd.period_range(str(s_year)+'-01-01', name='time', periods=ntime, freq='M')
ds['time'] = times
ds['month'] = ('time', times.month)
ds['year'] = ('time', times.year)
climatology = ds.groupby(ds.month).mean('time')
anomalies = ds.groupby(ds.month) - climatology
kuchaale commented 6 years ago

However, I found out that plotting and netcdf outputting is not possible due to object dtype of time coordinate.

lvankampenhout commented 6 years ago

+1 to this issue. I'm struggling big time with an 1800-year climate model dataset that I need to resample in order to make different annual means (June-May). Two remarks.

  1. Both Numpy and Pandas internally work with datetime64[ns] for storing the data. But whereas Numpy is perfectly able to work with deviant dates, e.g. np.datetime64('0500-01-01') works fine, Pandas is not: pd.to_datetime(np.datetime64('0500-01-01')) will fail. Can someone explain that one to me?
  2. From what I've read, the PeriodIndex seems like a viable alternative to Pandas timestamps whenever you don't need the nanosecond resolution (which I don't — monthly resolution suffices). As far as I can tell, xarray should support all operations out-of-the-box for PeriodIndex, which is not the case for cftime (notably, resampling is missing). But as noted above the support for PeriodIndex is broken, so high priority should IMO be given to addressing https://github.com/pydata/xarray/issues/1565 and https://github.com/pydata/xarray/issues/1270