'time.units' being dropped if differs across files being loaded

spencerahill commented 6 years ago

The data for a CMIP5 simulation I am trying to use is split into two netCDF files along the time dimension, as is common. But, annoyingly, they have encoded a different units attribute for each file. I think, as a result, the attribute is being dropped when open_mfdataset is called, based on this partial traceback:

  File "/home/Spencer.Hill/py/aospy/aospy/data_loader.py", line 246, in load_variable
    ds, min_year, max_year = _prep_time_data(ds)
  File "/home/Spencer.Hill/py/aospy/aospy/data_loader.py", line 160, in _prep_time_data
    ds, min_year, max_year = times.numpy_datetime_workaround_encode_cf(ds)
  File "/home/Spencer.Hill/py/aospy/aospy/utils/times.py", line 272, in numpy_datetime_workaround_encode_cf
    units = time.attrs['units']
KeyError: 'units'

The place where the attribute is being dropped is in the xr.open_mfdataset call in _load_data_from_disk. So it would appear we need to add some logic that checks that the time units attrs are the same across all netCDF files to be loaded. If not, we'll need to somehow force them to have the same units. Or something else? @spencerkclark, what do you think?

Here's an ncdump -h for each:

$ ncdump -h pr_Amon_CCSM4_amip4K_r1i1p1_197901-200512.nc
netcdf pr_Amon_CCSM4_amip4K_r1i1p1_197901-200512 {
dimensions:
    time = UNLIMITED ; // (324 currently)
    lat = 192 ;
    lon = 288 ;
    bnds = 2 ;
variables:
    double time(time) ;
        time:bounds = "time_bnds" ;
        time:units = "days since 1979-01-01 00:00:00" ;
        time:calendar = "noleap" ;
        time:axis = "T" ;
        time:long_name = "time" ;
        time:standard_name = "time" ;
...

and

$ ncdump -h pr_Amon_CCSM4_amip4K_r1i1p1_200601-200812.nc
netcdf pr_Amon_CCSM4_amip4K_r1i1p1_200601-200812 {
dimensions:
    time = UNLIMITED ; // (36 currently)
    lat = 192 ;
    lon = 288 ;
    bnds = 2 ;
variables:
    double time(time) ;
        time:bounds = "time_bnds" ;
        time:units = "days since 2004-01-01 00:00:00" ;
        time:calendar = "noleap" ;
        time:axis = "T" ;
        time:long_name = "time" ;
        time:standard_name = "time" ;
...

For those w/ GFDL access, the files in question are

/archive/pcmdi/repo/CMIP5/output/NCAR/CCSM4/amip4K/mon/atmos/Amon/r1i1p1/v20120711/pr/pr_Amon_CCSM4_amip4K_r1i1p1_197901-200512.nc
/archive/pcmdi/repo/CMIP5/output/NCAR/CCSM4/amip4K/mon/atmos/Amon/r1i1p1/v20120711/pr/pr_Amon_CCSM4_amip4K_r1i1p1_200601-200812.nc

spencerkclark commented 6 years ago

@spencerahill might you be able to address this with a preprocess_func argument to your DataLoader constructor (e.g. in the DictDataLoader docstring)?

spencerahill commented 6 years ago

@spencerkclark, thanks, yes, sort of forgot about preprocess_func. I will do so as an interim solution.

But once I get that working I'd like to adapt the logic as necessary to create a permanent internal fix. If this particular CMIP run is processed like this, it is likely others are too (although none of the other 19 that I'm working with are).

spencerahill commented 6 years ago

@spencerkclark here's what I came up with as the preprocess_func, which borrows heavily from utils.times.numpy_datetime_workaround_encode_cf, but applied to each netCDF file as opposed to the concatenated Dataset spanning all files:

def _ensure_matching_time_units(ds, **kwargs):
    """Force all datasets to have matching time units attribute."""
    target_units_yr = 1979  # arbitrary
    time = ds[TIME_STR]
    units = time.attrs['units']
    units_yr = int(units.split(' since ')[1].split('-')[0])
    # The logic below would likely fail for many edge cases, so make sure the data
    # adheres precisely to my expectations before attempting this workaround.
    do_shift_year = (units.startswith('days since') and
                     units.endswith('-01-01 00:00:00') and
                     time.attrs['calendar'] == 'noleap' and
                     units_yr != target_units_yr)
    if do_shift_year:
        days_diff = 365.*(units_yr - target_units_yr)
        for var_str in [TIME_STR, TIME_BOUNDS_STR, 'time_bnds']:
            if var_str in ds:
                var = ds[var_str]
                var += days_diff
                var.attrs['units'] = units.replace(str(units_yr),
                                                   str(target_units_yr))
    return ds

There are some immediate fixes that could account for some of the edge cases. But before digging any deeper I'm wondering if you have any comments. This has addressed my immediate need already, so no rush; I know you have a lot on your plate right now.

spencerkclark commented 6 years ago

@spencerahill sorry for taking a while to get back to you on this! What you have above is a good start and captures the essence of what needs to be done. In a perfect world something like the following would work in general:

def _ensure_matching_time_units(ds, **kwargs):
    target_units = 'days since 1979-01-01'
    for name in [TIME_STR, TIME_BOUNDS_STR]:
        if name in ds:
            units = ds[name].attrs['units']
            calendar = ds[name].attrs['calendar']
            var = xr.apply_ufunc(nc4.num2date, ds[name], units, calendar)
            var = xr.apply_ufunc(nc4.date2num, var, target_units, calendar)
            var.attrs['units'] = target_units
            var.attrs['calendar'] = calendar
            ds[name] = var
    return ds

But there are complications associated with standard (!) calendars. The issue is that in DataArrays, xarray automatically converts datetime.datetime objects to np.datetime64 objects. These are incompatible with nc4.date2num (it requires datetime.datetime objects for standard calendars). A workaround is to do the num2date and date2num parts all in one step (so the arrays remain in numpy-world before being converted back to DataArrays):

def _rebase_times(values, input_units, calendar, output_units):
    dates = nc4.num2date(values, input_units, calendar)
    return nc4.date2num(dates, output_units, calendar)

def _ensure_matching_time_units(ds, **kwargs):
    target_units = 'days since 1979-01-01'
    for name in [TIME_STR, TIME_BOUNDS_STR]:
        if name in ds:
            units = ds[name].attrs['units']
            calendar = ds[name].attrs['calendar']
            var = xr.apply_ufunc(_rebase_times, ds[name], units, calendar, target_units)
            var.attrs['units'] = target_units
            var.attrs['calendar'] = calendar
            ds[name] = var
    return ds

Lastly, I've found that the units on TIME_BOUNDS_STR are not always in the format required for decoding; therefore I have found it convenient to always use the units and calendar attributes of the TIME_STR var for all decoding. This leaves us with the following solution:

def _rebase_times(values, input_units, calendar, output_units):
    dates = nc4.num2date(values, input_units, calendar)
    return nc4.date2num(dates, output_units, calendar)

def _ensure_matching_time_units(ds, **kwargs):
    target_units = 'days since 1979-01-01'
    units = ds[TIME_STR].attrs['units']
    calendar = ds[TIME_STR].attrs['calendar']
    for name in [TIME_STR, TIME_BOUNDS_STR]:
        if name in ds:
            var = xr.apply_ufunc(_rebase_times, ds[name], units, calendar, target_units)
            var.attrs['units'] = target_units
            var.attrs['calendar'] = calendar
            ds[name] = var
    return ds

So still not as clean as we might like, but at least it delegates the parsing of the date units and time rebasing to upstream libraries, and is calendar-type agnostic.

Really the ideal way to address this is to decode the times before concatenating the files (this is indeed what xarray does in open_mfdataset if decode_times is set to True). If decode_times is set to False, then if conflicting units attributes are found, they are dropped. All this is to say, if we didn't have to worry about the year < 1678 workaround, this wouldn't be an issue.

spencerahill commented 6 years ago

@spencerkclark thanks for this excellent explanation and code snippets. I agree using the upstream, calendar-agnostic methods is the way to go. Will proceed (whenever it is I come back to this!) starting from your last _rebase_times and _ensure_matching_time_units.

All this is to say, if we didn't have to worry about the year < 1678 workaround, this wouldn't be an issue.

This statement is true for so many parts of our codebase! If only somebody came up with an upstream fix... 😜

spencerahill commented 6 years ago

@spencerkclark I was thinking about returning to this today, but now that the xarray cftime/netcdftimefix appears to be coming in the not-so-distant future, I thought twice. What do you think?

More generally, would it be useful for us to start tracking the (many) things we want to do once that's in? Could be helpful in organizing our efforts but maybe not worth the time?

spencerkclark commented 6 years ago

@spencerahill assuming that it's not too tricky to add a version of num2date that returns only netcdftime.datetime objects, I agree it shouldn't be too far off. Thanks for the review there!

Yes, it would be helpful to create an issue to organize our efforts. I'll try and get to it over the next few days.

spencerahill commented 6 years ago

OK sounds good.

Yes, it would be helpful to create an issue to organize our efforts. I'll try and get to it over the next few days.

Thanks, I appreciate that. I figure you're better equipped to do it than I am.

No rush -- after my blitz of activity today, I'll likely go dormant on this for another couple weeks or so...I'm finding lately the best way to make time for this without overdoing it is to block out a day or two one every few weeks. (I still check notifications nightly though for bug reports etc.)

spencerahill / aospy

'time.units' being dropped if differs across files being loaded #212