spencerahill / aospy

Python package for automated analysis and management of gridded climate data
Apache License 2.0
83 stars 12 forks source link

BOUNDS_STR and TIME_BOUNDS_STR #299

Closed jdossgollin closed 5 years ago

jdossgollin commented 5 years ago

Hi all, I'm working on the PRs (the xarray contrib guidelines are helpful -- am getting up to speed on that). In the meantime I have an issue working with data from NCEP-NCAR Reanalysis, i.e. the data at http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ncep.reanalysis.dailyavgs/pressure/air.2015.nc.

The data has a time_bnds variable but not anything that looks like an nv (the name given for the model boundaries in the example on the online documentation):

<xarray.Dataset>
Dimensions:    (lat: 73, level: 17, lon: 144, nbnds: 2, time: 731)
Coordinates:
  * lon        (lon) float32 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
  * level      (level) float32 1000.0 925.0 850.0 700.0 ... 50.0 30.0 20.0 10.0
  * lat        (lat) float32 90.0 87.5 85.0 82.5 ... -82.5 -85.0 -87.5 -90.0
  * time       (time) float64 1.885e+06 1.885e+06 ... 1.902e+06 1.902e+06
Dimensions without coordinates: nbnds
Data variables:
    time_bnds  (time, nbnds) float64 dask.array<shape=(731, 2), chunksize=(365, 2)>
    air        (time, level, lat, lon) float32 dask.array<shape=(731, 17, 73, 144), chunksize=(365, 17, 73, 144)>
Attributes:
    References:                      http://www.esrl.noaa.gov/psd/data/gridde...
    Conventions:                     COARDS
    title:                           mean daily NMC reanalysis (2014)
    history:                         created 2013/12 by Hoop (netCDF2.3)
    description:                     Data is from NMC initialized reanalysis\...
    platform:                        Model
    dataset_title:                   NCEP-NCAR Reanalysis 1
    DODS_EXTRA.Unlimited_Dimension:  time

I'm trying to perform some calculations with this data but am running into some problems, which seem to be coming from https://github.com/spencerahill/aospy/blob/develop/aospy/utils/times.py#L339-342. In particular, a quick read of the source code suggests that the issue is that ds[BOUNDS_STR] doesn't exist. From a search of the source code repo, there doesn't seem to be a place where ds[BOUNDS_STR] is created.

If I'm right, that should be a fixable problem -- if not, I'll provide a MWE.

Thanks very much for any thoughts!

/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/aospy/model.py:235: FutureWarning: xarray.DataArray.__contains__ currently checks membership in DataArray.coords, but in xarray v0.11 will change to check membership in array values.
  (TIME_STR in renamed_attr)):
INFO:root:Getting input data: Var instance "temp" (Thu Nov  8 11:26:06 2018)
WARNING:root:Skipping aospy calculation `<aospy.Calc instance: temp, cmip5_dynamical, reanalysis_model, reanalysis>` due to error with the following traceback: 
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/aospy/automate.py", line 251, in _compute_or_skip_on_error
    return calc.compute(**compute_kwargs)
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/aospy/calc.py", line 584, in compute
    self.end_date),
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/aospy/calc.py", line 429, in _get_all_data
    for n, var in enumerate(self.variables)]
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/aospy/calc.py", line 429, in <listcomp>
    for n, var in enumerate(self.variables)]
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/aospy/calc.py", line 381, in _get_input_data
    **self.data_loader_attrs)
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/aospy/data_loader.py", line 241, in load_variable
    ds, min_year, max_year = _prep_time_data(ds)
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/aospy/data_loader.py", line 161, in _prep_time_data
    ds = times.ensure_time_avg_has_cf_metadata(ds)
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/aospy/utils/times.py", line 413, in ensure_time_avg_has_cf_metadata
    ds[TIME_WEIGHTS_STR] = time_weights.drop(BOUNDS_STR)
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/xarray/core/dataarray.py", line 1403, in drop
    ds = self._to_temp_dataset().drop(labels, dim)
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/xarray/core/dataset.py", line 2602, in drop
    return self._drop_vars(labels)
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/xarray/core/dataset.py", line 2613, in _drop_vars
    self._assert_all_in_dataset(names)
  File "/usr/local/miniconda3/envs/gcm-lfv/lib/python3.6/site-packages/xarray/core/dataset.py", line 2581, in _assert_all_in_dataset
    raise ValueError('One or more of the specified variables '
ValueError: One or more of the specified variables cannot be found in this dataset
spencerkclark commented 5 years ago

Glad to hear the contributing guide has been helpful!

Thanks for all the info -- if I understand correctly, I think this is a case for another addition to #296. An addition of 'nbnds' here should hopefully do the trick.

jdossgollin commented 5 years ago

That looks like it should work -- will try it locally and assess.

More conceptually I think that, as pointed out in #182, we're unlikely to include all variables here, and so some code change that (a) allowed the user to input that mapping, and/or (b) more sophisticated default grid mappings, which defines a "WRF" standard, a "GFDL" standard, etc and then allows the user to choose from one of these defaults (or makes a smart guess?) could be useful.

jdossgollin commented 5 years ago

In case anyone is else is working with this data set, the following pre-processing function (for use with the data loader) is

def preprocess_reanalysis(data_set, **kwargs):
    """Fix time units"""
    if kwargs["intvl_in"] == "daily":
        data_set["time"].attrs["units"] = "hours since 1800-01-01 00:00:0.0"
    data_set = data_set.assign_coords(nbnds=data_set["nbnds"]).rename({"nbnds": BOUNDS_STR})
    return data_set

Although I will update nbnds in #296, the assign_coords is still necessary here.

spencerkclark commented 5 years ago

Perfect, thanks @jdossgollin. Just curious -- what does the traceback look like if you leave off assign_coords? We might consider that a bug.

spencerkclark commented 5 years ago

(a) allowed the user to input that mapping

This is what is implemented in #297 :)

jdossgollin commented 5 years ago

Perfect, thanks @jdossgollin. Just curious -- what does the traceback look like if you leave off assign_coords? We might consider that a bug.

I get the same ValueError as before (shown above) -- possibly because the .diff(BOUNDS_STR) only works if BOUNDS_STR is a coordinate of the data?

spencerkclark commented 5 years ago

I see, so in the ValueError the problem seems to be in this line. Indeed I think this is a bug. My gut tells me that this has to do with the fact that BOUNDS_STR is a dimension without coordinates in your dataset (which should be totally fine); therefore once squeeze is called in the preceding line, all evidence that BOUNDS_STR was once in there is gone (so we do not need to drop it).

Does wrapping the drop in an if statement fix things?

if BOUNDS_STR in time_weights.variables:
    time_weights = time_weights.drop(BOUNDS_STR)
ds[TIME_WEIGHTS_STR] = time_weights
jdossgollin commented 5 years ago

That yields AttributeError: 'DataArray' object has no attribute 'variables'. Perhaps you meant dims?

EDIT: making that change seems to make it work

if BOUNDS_STR in time_weights.dims:
    time_weights = time_weights.drop(BOUNDS_STR)
ds[TIME_WEIGHTS_STR] = time_weights
spencerkclark commented 5 years ago

Oops sorry, I think I meant time_weights.coords. squeeze should take BOUNDS_STR out of time_weights.dims; if it had associated coordinates then it would hang around as a scalar coordinate (so we would need to drop it).

jdossgollin commented 5 years ago

Hmm -- is there a better way to do this than what I put above? Sorry if I'm missing something obvious

spencerkclark commented 5 years ago

I just meant this:

if BOUNDS_STR in time_weights.coords:
    time_weights = time_weights.drop(BOUNDS_STR)
ds[TIME_WEIGHTS_STR] = time_weights

If I recall correctly the reason we have this logic is to remove BOUNDS_STR if it is a scalar coordinate. I think squeeze prevents it from being a dimension at this stage.

jdossgollin commented 5 years ago

Got it. I'll include this in upcoming PR.

spencerahill commented 5 years ago

(b) more sophisticated default grid mappings, which defines a "WRF" standard, a "GFDL" standard, etc and then allows the user to choose from one of these defaults (or makes a smart guess?) could be useful.

@jdossgollin definitely! This is what we want to implement down the road. I believe that has come up somewhere in our Issues in the past...regardless I think #297 is a useful intermediate step.

jdossgollin commented 5 years ago

Agreed -- this definitely makes it easy enough to implement everything I am trying to do now (or plan to do?)