pangeo-data / climpred

:earth_americas: Verification of weather and climate forecasts :earth_africa:
https://climpred.readthedocs.io
MIT License
225 stars 49 forks source link

Logic to determine frequency of verification dataset fails when monthly data times are on different days of the month #858

Open gmacgilchrist opened 4 weeks ago

gmacgilchrist commented 4 weeks ago

Description of bug For monthly average data, it is not uncommon for time indices to be on the middle day of the month, which varies from month the month. This breaks the logic in return_time_series_freq, which only picks out a monthly frequency if the time index day is the same for each month.

I encountered the issue while attempting to generate an uninitialized forecast. I think it was likewise causing silent issues in generating a persistence forecast, which was previously producing NaNs but works fine after implementing a hack (changing the time index of the verification dataset to match what's expected).

Code sample (reproducing the core logic of return_time_series_freq)

import cftime

# monthly separated time array
times = [cftime.DatetimeNoLeap(1,1,15),cftime.DatetimeNoLeap(1,2,14),cftime.DatetimeNoLeap(1,3,15)]
ds = xr.Dataset(coords={'time':times})

for freq in ['day','month','year']:
        # first dim values not equal all others
        if not (
            getattr(ds.isel({'time': 0})['time'].dt, freq) == getattr(ds['time'].dt, freq)
        ).all():
            print(freq)
            break

This returns a frequency of "day", which results in subsequent errors. To work around this, a user has to manipulate at least the verification dataset to have the same "day" for each month in the time index.

Would it be undesirable for the frequency of the verification dataset to be user specified in the same way as the units of the initialized dataset lead time need to be specified?

Output of climpred.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 4.18.0-553.5.1.el8_10.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 climpred: 2.4.0 xarray: 2023.12.0 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.1 cftime: 1.6.3 netcdf4: None nc_time_axis: 1.4.1 matplotlib: 3.8.2 cf_xarray: 0.9.2 xclim: 0.50.0 dask: 2024.5.0 distributed: 2024.5.0 setuptools: 69.5.1 pip: 24.0 conda: None IPython: 8.25.0 sphinx: None ```
aaronspring commented 4 weeks ago

Usually I went for "changing the time index of the verification dataset to match what's expected" ie fixing before using climpred. Mostly going for beginning of the month to just have 1s.

Not sure how difficult a change would be to implement but feel free.