Open jbusecke opened 4 years ago
My first thought was that I would handle it with a preprocess
function that looked for specific attributes and added an extra start
dimension based on the attribute.
But then, does it make sense to concatenate files with different start times along the same "time" axis like it's doing now?
My first thought was that I would handle it with a preprocess function that looked for specific attributes and added an extra start dimension based on the attribute.
Yeah that's what I thought, but as a user, I did not know that there were even different values, to begin with! On the intake-esm level, it would be easy to compare and determine if attrs are the same, which would have indicated to me that I was doing somthing wrong. I will solve this with a preprocess for now, but for that I need to go through the tedious step of checking all the raw files. I really think that it would be most logical and elegant to find a solution within intake-esm
.
But then, does it make sense to concatenate files with different start times along the same "time" axis like it's doing now?
Umm yeah the naming was a bit stupid here I think. Let me explain my case in more detail. I am dealing with a bunch of historical members, that are branched off a single piControl
run. Not knowing anything about the branching details, I looked at the metadata from the combined dataset and it seemed pretty easy. I can use the encoded time when run was branched off
to align both runs even if the control run has some abstract years (0-600 for instance), and the historical ones have 'real' years.
This worked out for some models, but I got things like this here:
This looked suspiciously as if only one of the members was aligned properly with the control run, and others are branched off later. Looking at the metadata of each member confirmed that suspicion (I also checked with some publications for single models and that is indeed what they do). So the attribute basically encodes 'time in the control run when this was branched of' (e.g. year 100/200/500 of the control run), while the actual time dimension should track the years 1850-2014, which represent the identical forcing for each historical member.
I am sure there are some other usecases based on these or other attributes that need the full information of each member, thus a general solution might be very helpful for other users too. Obviously this would clutter the coordinates of the resulting dataset, but perhaps we could enable this behavior via an option?
I was able to work around it by introducing a preprocessing function
def parse_attrs_to_coords(
ds,
dim="member_id",
attrs=['start'],
):
"""Convert global dataset attributes to coordinates, which will be concatenated along `dim` by intake-esm.
Parameters
----------
ds : xr.Dataset
Input dataset
dim : str, optional
Scalar dimension to be added. by default 'member_id'
attrs : list, optional
List of global attributes to be converted, by default ['branch_time_in_parent', 'branch_time_in_child', 'parent_time_units']
"""
def _clean(att):
"""Clean up the attributes, since they sometimes come as a raw value and other times as a 1-element list"""
if isinstance(att, list):
assert len(att) == 1
out = att[0]
else:
out = att
# Convert scalar values to data array with member_id dimension
return xr.DataArray(out).expand_dims(dim)
for attr in attrs:
if attr in ds.attrs.keys():
ds = ds.assign_coords({attr: _clean(ds.attrs[attr])})
del ds.attrs[attr]
return ds
Using this I am able to run the code from above with a minor modification
#ds.attrs['start']=starttime
ds.attrs['start']=starttime+10
and get this as a result
dd_pp = col.to_dataset_dict(preprocess=parse_attrs_to_coords)
dd_pp
{'test.test.test.test.test.test': <xarray.Dataset>
Dimensions: (dim_0: 3, member_id: 3)
Coordinates:
start (member_id) int64 10 11 12
* member_id (member_id) int64 0 1 2
Dimensions without coordinates: dim_0
Data variables:
data (member_id, dim_0) float64 dask.array<chunksize=(1, 3), meta=np.ndarray>
Attributes:
intake_esm_varname: data
intake_esm_dataset_key: test.test.test.test.test.test}
I do however still think it would be good to get achieve this within intake-esm without requiring the manual entry of attrs. Thoughts?
determine if attrs are the same, which would have indicated to me that I was doing somthing wrong.
We now have combine_attrs
that can be set to generate an error ("identical"
). It's probably not a good default because a single may have an extra attribute or be missing one... But this could be easy to check with existing catalog (do all datasets open successfully with combine_attrs="identical"
https://xarray.pydata.org/en/latest/generated/xarray.concat.html
It's probably not a good default because a single may have an extra attribute or be missing one...
Good point, but I wonder if on the intake-esm level it does make sense as a default. Maybe we could catch the error and produce a warning? I really think the user should not have to dig through the metadata to avoid the error described above, and intake-esm seems ideally positioned to avoid this?
I have been trying to detrend CMIP6 data using the metadata like
branch_time_in_parent
from models with multiple members and gotten some really confusing results. I was not able to align e.g. the historical members with the control run using that data.I dug a little deeper and it turns out some models do branch of members at different times. For the examples I looked at, each member had the correct date encoded in their dataset attributes, but upon combining them with intake-esm, errors can occur.
I wrote a 'shortish' example to demonstrate, what I think is a quite undesirable behavior:
First create some dummy datasets with a single attribute that is different
Create a matching catalog file
As a collection file I am using a slightly modified version (only the catalog path) of the pangeo CMIP6 file
Then I open with intake-esm and convert to a dataset dictionary
which gives:
It seems like the attribute
start
was just overwritten with the value of the last read dataset. I wonder if there is a way to parse this out. Perhaps an option to promote each attribute which differs to a coordinate (with dimensionmember_id
), so that this information is preserved in the most detail possible?Output of
intake_esm.__version__