Problematic behavior for combining dataset attributes.

jbusecke commented 4 years ago

I have been trying to detrend CMIP6 data using the metadata like branch_time_in_parent from models with multiple members and gotten some really confusing results. I was not able to align e.g. the historical members with the control run using that data.

I dug a little deeper and it turns out some models do branch of members at different times. For the examples I looked at, each member had the correct date encoded in their dataset attributes, but upon combining them with intake-esm, errors can occur.

I wrote a 'shortish' example to demonstrate, what I think is a quite undesirable behavior:

First create some dummy datasets with a single attribute that is different

# create 3 dummy datasets
import xarray as xr
import numpy as np

for starttime in range(3):
    ds = xr.DataArray(np.random.rand(3)).to_dataset(name='data')
    ds.attrs['start']=starttime
    ds.to_zarr(f'test_{starttime}')

Create a matching catalog file

# create a corresponding catalog file
import pandas as pd
pd.DataFrame(
    {
        'zstore':['test_0', 'test_1', 'test_2'],
        "member_id":range(3),
        'activity_id':['test', 'test', 'test'],
        'institution_id':['test', 'test', 'test'],
        'source_id':['test', 'test', 'test'],
        'experiment_id':['test', 'test', 'test'],
        'grid_label':['test', 'test', 'test'],
        'table_id':['test', 'test', 'test'],
        'variable_id':['data', 'data', 'data'],
    }).to_csv('test.csv.gz')

As a collection file I am using a slightly modified version (only the catalog path) of the pangeo CMIP6 file

{ "esmcat_version": "0.1.0", "id": "glade-cmip6", "description": "This is an ESM collection for CMIP6 data accessible on the Princeton University disk storage system in /scratch/gpfs/GEOCLIM/synda/data/CMIP6", "catalog_file": "./test.csv.gz", "attributes": [ { "column_name": "activity_id", "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json" }, { "column_name": "source_id", "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json" }, { "column_name": "institution_id", "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json" }, { "column_name": "experiment_id", "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json" }, { "column_name": "member_id", "vocabulary": "" }, { "column_name": "table_id", "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json" }, { "column_name": "variable_id", "vocabulary": "" }, { "column_name": "grid_label", { "column_name": "table_id", "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json" }, { "column_name": "variable_id", "vocabulary": "" }, { "column_name": "grid_label", "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json" }, { "column_name": "version", "vocabulary": "" }, { "column_name": "dcpp_start_year", "vocabulary": "" } ], "assets": { "column_name": "zstore", "format": "zarr" }, "aggregation_control": { "variable_column_name": "variable_id", "groupby_attrs": [ "activity_id", "institution_id", "source_id", "experiment_id", "table_id", "grid_label" ], "aggregations": [ { "type": "union", "attribute_name": "variable_id" }, { "type": "join_new", "attribute_name": "member_id", "options": { "coords": "minimal", "compat": "override" } } ] } }

Then I open with intake-esm and convert to a dataset dictionary

import intake
col = intake.open_esm_datastore('test.json')
dd = col.to_dataset_dict()
dd

which gives:

{'test.test.test.test.test.test': <xarray.Dataset>
 Dimensions:    (dim_0: 3, member_id: 3)
 Coordinates:
   * member_id  (member_id) int64 0 1 2
 Dimensions without coordinates: dim_0
 Data variables:
     data       (member_id, dim_0) float64 dask.array<chunksize=(1, 3), meta=np.ndarray>
 Attributes:
     start:                   2
     intake_esm_varname:      data
     intake_esm_dataset_key:  test.test.test.test.test.test}

It seems like the attribute start was just overwritten with the value of the last read dataset. I wonder if there is a way to parse this out. Perhaps an option to promote each attribute which differs to a coordinate (with dimension member_id), so that this information is preserved in the most detail possible?

Output of `intake_esm.version`

``` '2020.6.11' ```

dcherian commented 4 years ago

My first thought was that I would handle it with a preprocess function that looked for specific attributes and added an extra start dimension based on the attribute.

But then, does it make sense to concatenate files with different start times along the same "time" axis like it's doing now?

jbusecke commented 4 years ago

My first thought was that I would handle it with a preprocess function that looked for specific attributes and added an extra start dimension based on the attribute.

Yeah that's what I thought, but as a user, I did not know that there were even different values, to begin with! On the intake-esm level, it would be easy to compare and determine if attrs are the same, which would have indicated to me that I was doing somthing wrong. I will solve this with a preprocess for now, but for that I need to go through the tedious step of checking all the raw files. I really think that it would be most logical and elegant to find a solution within intake-esm.

But then, does it make sense to concatenate files with different start times along the same "time" axis like it's doing now?

Umm yeah the naming was a bit stupid here I think. Let me explain my case in more detail. I am dealing with a bunch of historical members, that are branched off a single piControl run. Not knowing anything about the branching details, I looked at the metadata from the combined dataset and it seemed pretty easy. I can use the encoded time when run was branched off to align both runs even if the control run has some abstract years (0-600 for instance), and the historical ones have 'real' years.

This worked out for some models, but I got things like this here:

This looked suspiciously as if only one of the members was aligned properly with the control run, and others are branched off later. Looking at the metadata of each member confirmed that suspicion (I also checked with some publications for single models and that is indeed what they do). So the attribute basically encodes 'time in the control run when this was branched of' (e.g. year 100/200/500 of the control run), while the actual time dimension should track the years 1850-2014, which represent the identical forcing for each historical member.

I am sure there are some other usecases based on these or other attributes that need the full information of each member, thus a general solution might be very helpful for other users too. Obviously this would clutter the coordinates of the resulting dataset, but perhaps we could enable this behavior via an option?

jbusecke commented 4 years ago

I was able to work around it by introducing a preprocessing function

def parse_attrs_to_coords(
    ds,
    dim="member_id",
    attrs=['start'],
):
    """Convert global dataset attributes to coordinates, which will be concatenated along `dim` by intake-esm.

    Parameters
    ----------
    ds : xr.Dataset
        Input dataset
    dim : str, optional
        Scalar dimension to be added. by default 'member_id'
    attrs : list, optional
        List of global attributes to be converted, by default ['branch_time_in_parent', 'branch_time_in_child', 'parent_time_units']
    """

    def _clean(att):
        """Clean up the attributes, since they sometimes come as a raw value and other times as a 1-element list"""
        if isinstance(att, list):
            assert len(att) == 1
            out = att[0]
        else:
            out = att
        # Convert scalar values to data array with member_id dimension
        return xr.DataArray(out).expand_dims(dim)

    for attr in attrs:
        if attr in ds.attrs.keys():
            ds = ds.assign_coords({attr: _clean(ds.attrs[attr])})
            del ds.attrs[attr]
    return ds

Using this I am able to run the code from above with a minor modification

#ds.attrs['start']=starttime
ds.attrs['start']=starttime+10

and get this as a result

dd_pp = col.to_dataset_dict(preprocess=parse_attrs_to_coords)
dd_pp

{'test.test.test.test.test.test': <xarray.Dataset>
 Dimensions:    (dim_0: 3, member_id: 3)
 Coordinates:
     start      (member_id) int64 10 11 12
   * member_id  (member_id) int64 0 1 2
 Dimensions without coordinates: dim_0
 Data variables:
     data       (member_id, dim_0) float64 dask.array<chunksize=(1, 3), meta=np.ndarray>
 Attributes:
     intake_esm_varname:      data
     intake_esm_dataset_key:  test.test.test.test.test.test}

I do however still think it would be good to get achieve this within intake-esm without requiring the manual entry of attrs. Thoughts?

dcherian commented 4 years ago

determine if attrs are the same, which would have indicated to me that I was doing somthing wrong.

We now have combine_attrs that can be set to generate an error ("identical"). It's probably not a good default because a single may have an extra attribute or be missing one... But this could be easy to check with existing catalog (do all datasets open successfully with combine_attrs="identical"

https://xarray.pydata.org/en/latest/generated/xarray.concat.html

jbusecke commented 4 years ago

It's probably not a good default because a single may have an extra attribute or be missing one...

Good point, but I wonder if on the intake-esm level it does make sense as a default. Maybe we could catch the error and produce a warning? I really think the user should not have to dig through the metadata to avoid the error described above, and intake-esm seems ideally positioned to avoid this?

intake / intake-esm