pangeo-data / climpred

:earth_americas: Verification of weather and climate forecasts :earth_africa:
https://climpred.readthedocs.io
MIT License
231 stars 49 forks source link

alignment is trying to add 'cftime._cftime.datetime' and 'datetime.timedelta' resulting in a TypeError #393

Closed brianpm closed 4 years ago

brianpm commented 4 years ago

I suspect this is user error, but I didn't want to go too far before asking for some help.

Here's the bottom line: I get to the point of having a HindcastEnsemble object with observations added, and then I go to verify like this:

hindcast.verify('NOAA_Tmax', metric='mse', reference='persistence')

That dies with the message:

TypeError: unsupported operand type(s) for +: 'cftime._cftime.datetime' and 'datetime.timedelta'

I've tried different invocations of verify, but they all end up with this same error.

The traceback indicates that this is coming from utils.py in shift_cftime_index(xobj, time_string, n, freq) as part of the alignment procedure.

My case is CESM atmosphere output from short (20-day) hindcasts which are initialized every 5 days (only one member per initial time). The hindcast object's initialized ensemble entry has dimensions:

init: 73 lat: 360 lead: 21 lon: 720

WIth init being an array of cftime.datetime objects. I made init by using init = xr.CFTimeIndex(dtim, name='init'). The lead dimension is just integers with units of days, which I made like this:

lead = xr.DataArray(np.arange(len(dstmp['time'])), dims='lead', attrs={"units":"days"})

The observations have dimensions of

lat: 360 lon: 720 time: 731

That is 2-years of daily observations that start at the same time as the hindcast, and extend beyond the max-lead of the last initialization. I tried to make the time dimension be as similar to the hindcast one as possible by loading with decode_times = False and then doing:

xr.CFTimeIndex(cftime.num2date(v_tmin_ds['time'], v_tmin_ds['time'].units))

The only difference I can see is that the hindcast values are cftime.datetime while the observations end up being cftime.DatetimeGregorian objects.

Any thoughts about what is going on? Let me know if you need more info.

Output of climpred.show_versions()

INSTALLED VERSIONS

commit: None python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1127.8.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

climpred: 2.1.0.post1 xarray: 0.15.1 pandas: 1.0.5 numpy: 1.18.5 scipy: 1.5.0 cftime: 1.1.3 dask: 2.18.1 distributed: 2.19.0 setuptools: 47.3.1.post20200616 pip: 20.1.1 conda: 4.8.3 pytest: None IPython: 7.13.0 sphinx: None

aaronspring commented 4 years ago

Hi brianpm, thanks for using climpred and raising the issue here.

I usually set the init dimension in hindcast and the time dimension in verification with xr.cftimerange(start=..., periods=..., freq=...). please try this. probably we should make our assumption relying on a time index rather than a time range more prominent in the docs: https://climpred.readthedocs.io/en/stable/setting-up-data.html.

We havent really figured out which time requirements to use, see https://github.com/bradyrx/climpred/issues/304

please also post the coordinates here: init dimension in hindcast and the time dimension in verification. with this we can more easily understand the problem.

bradyrx commented 4 years ago

Hi Brian!

The only difference I can see is that the hindcast values are cftime.datetime while the observations end up being cftime.DatetimeGregorian objects.

This is probably the issue. xarray wants to align over identical calendar/datetime objects. I think you and I have talked before about the headache of calendars...

I was thinking of putting in a PR soon with a hard check that when you add an observation to HindcastEnsemble it should be comprised of datetimes of the identical type to the init dimension. Do you have the obs and forecasts somewhere on glade? This would be a good test case for me to structure a small PR around to deal with this.

The short answer is that you need to make sure that hind['init'] and verif['time'] have identical datetime styles. I need to look into 'datetime.timedelta' specifically. I've been checking for pandas/numpy datetimes and cftime, but there is a third timedelta type, as I've seen it referenced in the base xarray calendar management code.

bradyrx commented 4 years ago

Yes, in short, and maybe repeating Aaron here. But you should for the obs do:

obs['time'] = xr.cftime_range(start='1990-01-01', freq='D', periods=obs.time.size, calendar='gregorian')

Where you replace '1990-01-01' with the starting time.

brianpm commented 4 years ago

Great. I'll try this today.

This data is on the CGD filesystem. @bradyrx, if you have access to that, I can point you to it, otherwise, I might need to move the same data to glade for you to look at it.

bradyrx commented 4 years ago

@brianpm , I don't think I have access to it. You're welcome to trim it down if it's big and move it to scratch. Whatever works! Just post the pointer here once it's moved. Thanks!

brianpm commented 4 years ago

Switching to using xr.cftime_range() for obs['time'] seems to work. At least, the verify step completes and produces a result. I didn't remove the leap day from my observations, but I guess that is consistent with just taking lead to be an integer number of days from initialization.

@bradyrx -- I put some sample data in scratch: /glade/scratch/brianpm/forRiley

I put my script there too, so you can see how I am putting together the data.

Thanks for the help. I'll keep messing around with the metrics now that I can get that far.

UPDATE I spoke too soon. I didn't notice that my cftime_range was set up wrong for the hindcasts. Their initial time is every 5 days. Now have what I think is the correct init in my hindcast, but then I was getting a warning saying

A common set of verification dates cannot be found for the initializations and verification data supplied.

I dropped leap days from my observations and used 'noleap' for the calendar (both model and obs), but this issue persists.

@bradyrx , I updated the .py script in scratch to reflect these changes. Maybe there's an obvious issue that I don't see.

Here are the coordinates:

>>> print(ds['init'])

<xarray.DataArray 'init' (init: 73)>
array([cftime.DatetimeNoLeap(2016-01-05 00:00:00),
       cftime.DatetimeNoLeap(2016-01-10 00:00:00),
       cftime.DatetimeNoLeap(2016-01-15 00:00:00),
       cftime.DatetimeNoLeap(2016-01-20 00:00:00),
       cftime.DatetimeNoLeap(2016-01-25 00:00:00),
       cftime.DatetimeNoLeap(2016-01-30 00:00:00),
       cftime.DatetimeNoLeap(2016-02-04 00:00:00),
       cftime.DatetimeNoLeap(2016-02-09 00:00:00),
       cftime.DatetimeNoLeap(2016-02-14 00:00:00),
       cftime.DatetimeNoLeap(2016-02-19 00:00:00),
       cftime.DatetimeNoLeap(2016-02-24 00:00:00),
       cftime.DatetimeNoLeap(2016-03-01 00:00:00),
....<snip> ...
       cftime.DatetimeNoLeap(2016-11-26 00:00:00),
       cftime.DatetimeNoLeap(2016-12-01 00:00:00),
       cftime.DatetimeNoLeap(2016-12-06 00:00:00),
       cftime.DatetimeNoLeap(2016-12-11 00:00:00),
       cftime.DatetimeNoLeap(2016-12-16 00:00:00),
       cftime.DatetimeNoLeap(2016-12-21 00:00:00),
       cftime.DatetimeNoLeap(2016-12-26 00:00:00),
       cftime.DatetimeNoLeap(2016-12-31 00:00:00)], dtype=object)
Coordinates:
  * init     (init) object 2016-01-05 00:00:00 ... 2016-12-31 00:00:00
>>> print(obs['time'])

<xarray.DataArray 'time' (time: 730)>
array([cftime.DatetimeNoLeap(2016-01-01 00:00:00),
       cftime.DatetimeNoLeap(2016-01-02 00:00:00),
       cftime.DatetimeNoLeap(2016-01-03 00:00:00), ...,
       cftime.DatetimeNoLeap(2017-12-29 00:00:00),
       cftime.DatetimeNoLeap(2017-12-30 00:00:00),
       cftime.DatetimeNoLeap(2017-12-31 00:00:00)], dtype=object)
Coordinates:
  * time     (time) object 2016-01-01 00:00:00 ... 2017-12-31 00:00:00
bradyrx commented 4 years ago

@brianpm, this is annoying error message that is really unique to the S2S timescale. The default alignment style is to try to find a common set of verification dates that verify at all leads. That's not really possible when you're initializing every 5 days. See https://climpred.readthedocs.io/en/stable/alignment.html. If you look at the visualization under same_verifs it will be more clear.

In the next release, we're going to make it so that the user has to declare an alignment style in .verify() so thought goes into every step of the process.

If you do .verify(alignment='maximize') it should work. I will double check with what you have on disk tomorrow. Let me know if that helps.

brianpm commented 4 years ago

When I switch to alignment='maximize' in this line:

ver = hindcast.verify('NOAA_Tmax', alignment='maximize', metric='mse')

I'm getting an error that seems to originate from xskillscore:

ValueError: dimension 'time' on 1th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single dask array chunk along this dimension, i.e., .chunk({'time': -1}), but beware that this may significantly increase memory usage.

The dask='parallelized' options seems to be hard-wired in xskillscore.

This seems like a different issue, so that's progress.

aaronspring commented 4 years ago

You probably need to rechunk time of your verification. For initialised, lead: 1 is done by the algorithm anyways. We could introduce this automatic chunking into cskillscore or climpred directly but decided to not do this because then the user needs to understand a bit about chunking.

bradyrx commented 4 years ago

Yes this seems to be a recurring problem @aaronspring , we might want to reintroduce the auto-rechunking with a warning.

@brianpm, this is occurring because your passing in a dask array. That would happen either automatically by any use of xr.open_mfdataset or of course if you pre-chunked it. Since comparisons are happening over the init/time dimension, you can't have any chunking on that dimension. So before you initialize the HindcastEnsemble, you should chunk your datasets with something like:

ds.chunk({'init': -1, 'lat': 'auto', ...}) and obs.chunk({'time': -1, 'lat'; 'auto', ...}). I think you might be able to do HindcastEnsemble.chunk({'time': -1, ...}).chunk({'init': -1, ...}) but it's probably easier to do it before instantiating the object. The reason the warning says 'time' even though it's probably your 'init' is because we rename 'init' intermediately to align/compare the init/time dimensions in verification.

Regardless, there's a lot of quality-of-life small fixes I can make based on this issue thread.

brianpm commented 4 years ago
I forgot that my obs data array was a dask array at all. That makes sense. Using `obs.chunk({'time':-1})` was what was needed. Now I think I'm getting the expected output using `alignment='maximize'` and/or `alignment='same_inits'`. Thanks for all the hand-holding.
bradyrx commented 4 years ago

No problem @brianpm, glad it worked. We're excited to have people try it out, and again, this will inspire some quality-of-life modifications to make sure you don't have to go through this many steps in the future.