Problems combining members with different runtime

jbusecke commented 4 years ago

I have had some trouble reading the 'IPSL' model with intake esm locally ( on the Princeton Server tigressdata). When I am attempting to read all available o2 data for the ssp585 scenario.

# First find all models with *any* o2 data on tigress
url = "/tigress/GEOCLIM/LRGROUP/jbusecke/code/intake-esm-datastore/catalogs/tigressdata-cmip6.json"
col = intake.open_esm_datastore(url)
cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585')#, member_id='r1i1p1f1'
ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': False,})

This throws an error:

distributed.worker - WARNING - Compute Failed Function: execute_task args: ((, (, ['ScenarioMIP', 'IPSL', 'IPSL-CM6A-LR', 'ssp585', 'Omon', 'gn']), activity_id institution_id source_id experiment_id member_id table_id \ 0 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r14i1p1f1 Omon 1 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 2 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 3 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 4 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 5 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r2i1p1f1 Omon 6 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r3i1p1f1 Omon 7 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r4i1p1f1 Omon 8 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r6i1p1f1 Omon variable_id grid_label dcpp_init_year version kwargs: {} Exception: ValueError("cannot reindex or align along dimension 'time' because the index has duplicate values") ----------------------------------------------------------- ValueError Traceback (most recent call last) in 4 col = intake.open_esm_datastore(url) 5 cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585')#, member_id='r1i1p1f1' ----> 6 ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': False,})# YUP this fails, when choosing all member_ids /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/intake_esm/core.py in to_dataset_dict(self, zarr_kwargs, cdf_kwargs, preprocess, aggregate, storage_options, progressbar) 378 self.progressbar = progressbar 379 --> 380 return self._open_dataset() 381 382 def _open_dataset(self): /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/intake_esm/core.py in _open_dataset(self) 472 ) 473 --> 474 dsets = client.gather(futures) 475 self._ds = {group_id: ds for (group_id, ds) in dsets} 476 /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous) 1886 else: 1887 local_worker = None -> 1888 return self.sync( 1889 self._gather, 1890 futures, /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs) 775 return future 776 else: --> 777 return sync( 778 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs 779 ) /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs) 346 if error[0]: 347 typ, exc, tb = error[0] --> 348 raise exc.with_traceback(tb) 349 else: 350 return result[0] /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/distributed/utils.py in f() 330 if callback_timeout is not None: 331 future = asyncio.wait_for(future, callback_timeout) --> 332 result[0] = yield future 333 except Exception as exc: 334 error[0] = sys.exc_info() /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/tornado/gen.py in run(self) 733 734 try: --> 735 value = future.result() 736 except Exception: 737 exc_info = sys.exc_info() /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker) 1751 exc = CancelledError(key) 1752 else: -> 1753 raise exception.with_traceback(traceback) 1754 raise exc 1755 if errors == "skip": /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/intake_esm/core.py in _load_group_dataset(key, df, col_data, agg_columns, aggregation_dict, path_column_name, variable_column_name, use_format_column, mapper_dict, zarr_kwargs, cdf_kwargs, preprocess) 549 ) 550 --> 551 ds = _aggregate( 552 aggregation_dict, 553 agg_columns, /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/intake_esm/merge_util.py in _aggregate(aggregation_dict, agg_columns, n_agg, v, lookup, mapper_dict, zarr_kwargs, cdf_kwargs, preprocess) 174 return ds 175 --> 176 return apply_aggregation(v) 177 178 /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/intake_esm/merge_util.py in apply_aggregation(v, agg_column, key, level) 119 agg_options = {} 120 --> 121 dsets = [ 122 apply_aggregation(value, agg_column, key=key, level=level + 1) 123 for key, value in v.items() /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/intake_esm/merge_util.py in (.0) 120 121 dsets = [ --> 122 apply_aggregation(value, agg_column, key=key, level=level + 1) 123 for key, value in v.items() 124 ] /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/intake_esm/merge_util.py in apply_aggregation(v, agg_column, key, level) 147 ) 148 varname = dsets[0].attrs['intake_esm_varname'] --> 149 ds = join_new( 150 dsets, 151 dim_name=agg_column, /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/intake_esm/merge_util.py in join_new(dsets, dim_name, coord_value, varname, options) 23 except Exception as e: 24 logger.error(f'Failed to join datasets along new dimension.') ---> 25 raise e 26 27 /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/intake_esm/merge_util.py in join_new(dsets, dim_name, coord_value, varname, options) 20 try: 21 concat_dim = xr.DataArray(coord_value, dims=(dim_name), name=dim_name) ---> 22 return xr.concat(dsets, dim=concat_dim, data_vars=varname, **options) 23 except Exception as e: 24 logger.error(f'Failed to join datasets along new dimension.') /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/core/concat.py in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join) 133 "objects, got %s" % type(first_obj) 134 ) --> 135 return f(objs, dim, data_vars, coords, compat, positions, fill_value, join) 136 137 /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/core/concat.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join) 316 # Make sure we're working on a copy (we'll be loading variables) 317 datasets = [ds.copy() for ds in datasets] --> 318 datasets = align( 319 *datasets, join=join, copy=False, exclude=[dim], fill_value=fill_value 320 ) /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects) 335 new_obj = obj.copy(deep=copy) 336 else: --> 337 new_obj = obj.reindex(copy=copy, fill_value=fill_value, **valid_indexers) 338 new_obj.encoding = obj.encoding 339 result.append(new_obj) /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/core/dataset.py in reindex(self, indexers, method, tolerance, copy, fill_value, **indexers_kwargs) 2488 2489 """ -> 2490 return self._reindex( 2491 indexers, 2492 method, /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/core/dataset.py in _reindex(self, indexers, method, tolerance, copy, fill_value, sparse, **indexers_kwargs) 2517 raise ValueError("invalid reindex dimensions: %s" % bad_dims) 2518 -> 2519 variables, indexes = alignment.reindex_variables( 2520 self.variables, 2521 self.sizes, /tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/core/alignment.py in reindex_variables(variables, sizes, indexes, indexers, method, tolerance, copy, fill_value, sparse) 545 546 if not index.is_unique: --> 547 raise ValueError( 548 "cannot reindex or align along dimension %r because the " 549 "index has duplicate values" % dim ValueError: cannot reindex or align along dimension 'time' because the index has duplicate values

I think the problem is that one of the members (r1i1p1f1) has a longer runtime than the others.

If you only load that member it works:

cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585',member_id='r1i1p1f1')
ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True,})
ddict

{'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Omon.gn': <xarray.Dataset>
 Dimensions:         (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 3432, x: 362, y: 332)
 Coordinates:
     nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
     nav_lon         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
   * time            (time) object 2015-01-16T12:00:00 ... 2300-12-16 12:00:00
   * olevel          (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
   * member_id       (member_id) <U8 'r1i1p1f1'
 Dimensions without coordinates: axis_nbounds, nvertex, x, y
 Data variables:
     bounds_nav_lon  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
     bounds_nav_lat  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
     time_bounds     (time, axis_nbounds) object dask.array<chunksize=(6, 2), meta=np.ndarray>
     olevel_bounds   (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
     area            (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
     o2              (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>

But you can also see that it has 3000+ timesteps (running till 2300).

For comparison, other members only have ~1000 timesteps (standard time until 2100):

cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585',member_id='r2i1p1f1')#, 
ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True,})# YUP this fails, when choosing all member_ids
ddict

{'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Omon.gn': <xarray.Dataset>
 Dimensions:         (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
 Coordinates:
     nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
     nav_lon         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
   * time            (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
   * olevel          (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
   * member_id       (member_id) <U8 'r2i1p1f1'
 Dimensions without coordinates: axis_nbounds, nvertex, x, y
 Data variables:
     bounds_nav_lon  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
     bounds_nav_lat  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
     time_bounds     (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
     olevel_bounds   (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
     area            (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
     o2              (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>

It seems that intake-esm struggles with merging those datasets. In the cloud storage the extended member was removed, so the initial call works (cc @naomi-henderson). But I would be really interested in using these extended data.

Is there a way to make this work? I thought it might be possible to alter the attrs of the data sets with preprocessing, but had no luck in getting this to work. I think in principle these are mislabelled and should be a separate experiment ssp585-extended or similar.

Any suggestions for solving this?

andersy005 commented 4 years ago

Is there a way to make this work?

Yes, this is doable. This seems to be related to the time alignment issue in https://github.com/NCAR/intake-esm/issues/197. Earlier versions of intake-esm (prior v2019.10) used to do time alignment for use cases in which one ensemble member had extended time axis. If nobody beats me to it, I will look into this in the coming weeks (starting by next week).

andersy005 commented 4 years ago

Any suggestions for solving this?

Here's the code that did time alignment in earlier versions of intake-esm:

https://github.com/NCAR/intake-esm/blob/b9298567f806d429d630fecc46d352c9ec782a02/intake_esm/aggregate.py#L242-L262

Cc @dcherian in case he has suggestions on how to address this issue.

dcherian commented 4 years ago

One of those datasets has duplicate time values. Fix that and things should work.

jbusecke commented 4 years ago

@dcherian, that was my first thought too, but I was not able to find duplicate times, when loading each file individually, so I assumed the duplicates are created internally somehow? I will check again carefully...

dcherian commented 4 years ago

Did you use ds.indexes["time"].is_unique to check?

jbusecke commented 4 years ago

Oh no. I just compared len(ds.time) and len(np.unique(ds.time.data)) and they are the same lenght. Let me check with your method...

jbusecke commented 4 years ago

Yes I can confirm that each dataset has unique time values. So presumable something inside intake-esm causes a duplication?

I also checked if I can read all other members (except the long one), and that works perfectly fine.

dcherian commented 4 years ago

I guess it could be a different dimension

dcherian commented 4 years ago

Or not. The error says "time" hmm

sherimickelson commented 4 years ago

I'm just learning intake-esm so forgive me if this question is misdirected, but I was looking at your query results and I noticed something that doesn't match up. I looked up the available o2 files from IPSL on ESGF for ssp585 (https://esgf-node.llnl.gov/search/cmip6/?source_id=IPSL-CM6A-LR&variable_id=o2&experiment_id=ssp585) and the results are different. While ensemble members 14, 2, 3, 4, and 6 all have one file to download (years 2015-2100), ensemble member 1 has 3 files to download (years 2015-2300). Ensemble member 1 doesn't match because it has 4 query entries listed in your error report: 0 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r14i1p1f1 Omon 1 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 2 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 3 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 4 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 5 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r2i1p1f1 Omon 6 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r3i1p1f1 Omon 7 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r4i1p1f1 Omon 8 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r6i1p1f1 Omon Do you have a duplicate file that is being found and this is causing duplicates in the time dimension? Or is the error within intake-esm?

jbusecke commented 4 years ago

@sherimickelson yes that is right. It seems that this particular model ran an extended ssp585 scenario for member r1i1p1f1, but I am unable to 'pick that out' with the given functionality in intake-esm. Ideally I would like intake-esm in this case to concat the members in time first and then merge them (producing missing values for 2100-2300 for all other members... Another solution would be to somehow not use the extended time (on the pangeo cloud, these files were just removed), but for my specific science application this long run is extremely interesting, so I would love to investigate it!

andersy005 commented 4 years ago

@jbusecke, what's the output of

cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585')
cat.df['path'].tolist()

???

andersy005 commented 4 years ago

Also, is the CSV from /tigress/GEOCLIM/LRGROUP/jbusecke/code/intake-esm-datastore/catalogs/tigressdata-cmip6.json publicly accessible?

jbusecke commented 4 years ago

The output is:

['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r14i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r14i1p1f1_gn_201501-210012.nc',
 '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc',
 '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc',
 '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_210101-220012.nc',
 '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_220101-230012.nc',
 '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r2i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r2i1p1f1_gn_201501-210012.nc',
 '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r3i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r3i1p1f1_gn_201501-210012.nc',
 '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r4i1p1f1/Omon/o2/gn/v20191122/o2_Omon_IPSL-CM6A-LR_ssp585_r4i1p1f1_gn_201501-210012.nc',
 '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r6i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r6i1p1f1_gn_201501-210012.nc']

The csv is not publicly accessible at the moment, but I could send it to you if that works? Do you need access to the data to debug? Or just the csv?

jbusecke commented 4 years ago

Check your email for the files.

sherimickelson commented 4 years ago

@jbusecke It looks like your catalog includes both versions of this file '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', It looks like that's where the duplication in time is happening. You might want to make sure it's not using version v20190119 because the other files in that time series are using v20190903.

andersy005 commented 4 years ago

@sherimickelson, thank you for catching the version issue

@jbusecke, when you built the catalog, did you use the --pick-latest-version flag?

As Sheri pointed out, it appears that for some files, there's more than one version:

/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc, ---> version=v20190119
/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc, ---> version=v20190903

jbusecke commented 4 years ago

Ohh that would make sense I guess. But I did invoke the flag you mentioned. Here is the full command: python cmip.py --root-path /tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6 --pick-latest-version --cmip-version 6 --csv-filepath ../catalogs/tigressdata-cmip6.csv.gz --depth 4

Should I rebuild the catalog once more?

andersy005 commented 4 years ago

Should I rebuild the catalog once more?

Not yet. I am looking into the csv you just sent me to make sure that --pick-latest-version flag is working as expected.

sherimickelson commented 4 years ago

@andersy005 Does the latest flag only work on directories called "latest" like we have on glade? Or does it look and see which version number is higher? If it only does the first, that could be the issue because neither are in a "latest" directory. (I can't remember what the code does)

andersy005 commented 4 years ago

Does the latest flag only work on directories called "latest" like we have on glade? Or does it look and see which version number is higher?

It looks for which version number is higher.

def _pick_latest_version(df):
    import itertools

    grpby = list(set(df.columns.tolist()) - {'path', 'version'})
    groups = df.groupby(grpby)

    @dask.delayed
    def _pick_latest_v(group):
        idx = []
        if group.version.nunique() > 1:
            idx = group.sort_values(by=['version'], ascending=False).index[1:].values.tolist()
        return idx

    idx_to_remove = [_pick_latest_v(group) for _, group in groups]
    print('Getting latest version...\n')
    with ProgressBar():
        idx_to_remove = dask.compute(*idx_to_remove)

    idx_to_remove = list(itertools.chain(*idx_to_remove))
    df = df.drop(index=idx_to_remove)
    print('\nDone....\n')
    return df

jbusecke commented 4 years ago

Not sure if this is relevant, but if I specify member_id='r1i1p1f1', it still registers the duplicate files:

['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_210101-220012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_220101-230012.nc']

yet it combines the files without complaining...:

<xarray.Dataset>
Dimensions:         (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 3432, x: 362, y: 332)
Coordinates:
    nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
    nav_lon         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
  * olevel          (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
  * time            (time) object 2015-01-16T12:00:00 ... 2300-12-16 12:00:00
  * member_id       (member_id) <U8 'r1i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
    olevel_bounds   (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
    time_bounds     (time, axis_nbounds) object dask.array<chunksize=(6, 2), meta=np.ndarray>
    bounds_nav_lat  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    area            (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
    bounds_nav_lon  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    o2              (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
    branch_time_in_parent:  60265.0
    parent_source_id:       IPSL-CM6A-LR
    frequency:              mon
    source_type:            AOGCM BGC
    realm:                  ocnBgchem
    branch_method:          standard
    sub_experiment_id:      none
    parent_variant_label:   r1i1p1f1
    description:            Future scenario with high radiative forcing by th...
    external_variables:     areacello volcello
    history:                none
    sub_experiment:         none
    branch_time_in_child:   0.0
    source:                 IPSL-CM6A-LR (2017):  atmos: LMDZ (NPv6, N96; 144...
    creation_date:          2019-08-22T22:26:32Z
    mip_era:                CMIP6
    intake_esm_varname:     o2
    forcing_index:          1
    name:                   /ccc/work/cont003/gencmip6/oboucher/IGCM_OUT/IPSL...
    grid:                   native ocean tri-polar grid with 105 k ocean cells
    parent_time_units:      days since 1850-01-01 00:00:00
    variable_id:            o2
    license:                CMIP6 model data produced by IPSL is licensed und...
    parent_activity_id:     CMIP
    nominal_resolution:     100 km
    experiment_id:          ssp585
    activity_id:            ScenarioMIP
    parent_mip_era:         CMIP6
    title:                  IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
    product:                model-output
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
    institution_id:         IPSL
    variant_label:          r1i1p1f1
    data_specs_version:     01.00.28
    grid_label:             gn
    realization_index:      1
    tracking_id:            hdl:21.14100/dcd42bc5-cc58-4234-b58e-21e4b624ba04...
    institution:            Institut Pierre Simon Laplace, Paris 75252, France
    initialization_index:   1
    experiment:             update of RCP8.5 based on SSP5
    contact:                ipsl-cmip6@listes.ipsl.fr
    dr2xml_md5sum:          c4b76079137f2c3b9298396d121b21c1
    table_id:               Omon
    source_id:              IPSL-CM6A-LR
    CMIP6_CV_version:       cv=6.2.15.1
    parent_experiment_id:   historical
    Conventions:            CF-1.7 CMIP-6.2
    physics_index:          1
    dr2xml_version:         1.16
    model_version:          6.1.8
    EXPID:                  ssp585
    variant_info:           Each member starts from the corresponding member ...

dcherian commented 4 years ago

It will combine but check .indexes["time"].is_unique of the combined result...

dcherian commented 4 years ago

andersy005 commented 4 years ago

@jbusecke, as a temporary solution, try the following and see what happens:

url = "/tigress/GEOCLIM/LRGROUP/jbusecke/code/intake-esm-datastore/catalogs/tigressdata-cmip6.json"
col = intake.open_esm_datastore(url)
cat = col.search(source_id="IPSL-CM6A-LR", variable_id="o2", experiment_id="ssp585")
cat.df = cat.df.drop(index=[1]) # Drop the problematic file
ddict = cat.to_dataset_dict(cdf_kwargs={"chunks": {"time": 6}, "decode_times": True,})

jbusecke commented 4 years ago

@dcherian this is what I did:

members = cat.df['member_id'].unique()
for member in members:
    print(f'#####################{member}##################')
    print(member)
    cat_single = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585',member_id=member)
    ddict = cat_single.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True,})
    assert len(ddict.keys()) == 1
    _,ds = ddict.popitem()
    print(ds.indexes["time"].is_unique)
    print(f"{len(ds.time)}/{len(np.unique(ds.time.data))}")
    print(cat_single.df['path'].tolist())
    print(ds)

I get:

#####################r14i1p1f1##################
r14i1p1f1

--> The keys in the returned dictionary of datasets are constructed as follows:
    'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r14i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r14i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions:         (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
    nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
    nav_lon         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
  * olevel          (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
  * time            (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
  * member_id       (member_id) <U9 'r14i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
    olevel_bounds   (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
    time_bounds     (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
    bounds_nav_lat  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    area            (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
    bounds_nav_lon  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    o2              (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
    name:                   /ccc/work/cont003/gencmip6/oboucher/IGCM_OUT/IPSL...
    Conventions:            CF-1.7 CMIP-6.2
    creation_date:          2019-10-13T08:15:27Z
    tracking_id:            hdl:21.14100/fefe8ea3-d0b3-4828-ba16-b66fad793928
    description:            Future scenario with high radiative forcing by th...
    title:                  IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
    activity_id:            ScenarioMIP
    contact:                ipsl-cmip6@listes.ipsl.fr
    data_specs_version:     01.00.28
    dr2xml_version:         1.16
    experiment_id:          ssp585
    experiment:             update of RCP8.5 based on SSP5
    external_variables:     areacello volcello
    forcing_index:          1
    frequency:              mon
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
    grid:                   native ocean tri-polar grid with 105 k ocean cells
    grid_label:             gn
    nominal_resolution:     100 km
    history:                none
    initialization_index:   1
    institution_id:         IPSL
    institution:            Institut Pierre Simon Laplace, Paris 75252, France
    license:                CMIP6 model data produced by IPSL is licensed und...
    mip_era:                CMIP6
    parent_mip_era:         CMIP6
    parent_source_id:       IPSL-CM6A-LR
    parent_time_units:      days since 1850-01-01 00:00:00
    parent_variant_label:   r14i1p1f1
    branch_method:          standard
    branch_time_in_parent:  60265.0
    branch_time_in_child:   0.0
    physics_index:          1
    product:                model-output
    realization_index:      14
    realm:                  ocnBgchem
    source:                 IPSL-CM6A-LR (2017):  atmos: LMDZ (NPv6, N96; 144...
    source_id:              IPSL-CM6A-LR
    source_type:            AOGCM BGC
    sub_experiment_id:      none
    sub_experiment:         none
    table_id:               Omon
    variable_id:            o2
    variant_info:           Each member starts from the corresponding member ...
    variant_label:          r14i1p1f1
    EXPID:                  ssp585
    CMIP6_CV_version:       cv=6.2.15.1
    dr2xml_md5sum:          b6f602401512e82e2d7cadc2c6f36c2a
    model_version:          6.1.10
    parent_experiment_id:   historical
    parent_activity_id:     CMIP
    intake_esm_varname:     o2
#####################r1i1p1f1##################
r1i1p1f1

--> The keys in the returned dictionary of datasets are constructed as follows:
    'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There is/are 1 group(s)

/tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/coding/times.py:426: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
  dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
/tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/coding/times.py:426: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
  dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
/tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/numpy/core/_asarray.py:85: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
  return array(a, dtype, copy=False, order=order)

True
3432/3432
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_210101-220012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_220101-230012.nc']
<xarray.Dataset>
Dimensions:         (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 3432, x: 362, y: 332)
Coordinates:
    nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
    nav_lon         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
  * olevel          (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
  * time            (time) object 2015-01-16T12:00:00 ... 2300-12-16 12:00:00
  * member_id       (member_id) <U8 'r1i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
    olevel_bounds   (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
    time_bounds     (time, axis_nbounds) object dask.array<chunksize=(6, 2), meta=np.ndarray>
    bounds_nav_lat  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    area            (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
    bounds_nav_lon  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    o2              (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
    branch_time_in_parent:  60265.0
    parent_source_id:       IPSL-CM6A-LR
    frequency:              mon
    source_type:            AOGCM BGC
    realm:                  ocnBgchem
    branch_method:          standard
    sub_experiment_id:      none
    parent_variant_label:   r1i1p1f1
    description:            Future scenario with high radiative forcing by th...
    external_variables:     areacello volcello
    history:                none
    sub_experiment:         none
    branch_time_in_child:   0.0
    source:                 IPSL-CM6A-LR (2017):  atmos: LMDZ (NPv6, N96; 144...
    creation_date:          2019-08-22T22:26:32Z
    mip_era:                CMIP6
    intake_esm_varname:     o2
    forcing_index:          1
    name:                   /ccc/work/cont003/gencmip6/oboucher/IGCM_OUT/IPSL...
    grid:                   native ocean tri-polar grid with 105 k ocean cells
    parent_time_units:      days since 1850-01-01 00:00:00
    variable_id:            o2
    license:                CMIP6 model data produced by IPSL is licensed und...
    parent_activity_id:     CMIP
    nominal_resolution:     100 km
    experiment_id:          ssp585
    activity_id:            ScenarioMIP
    parent_mip_era:         CMIP6
    title:                  IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
    product:                model-output
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
    institution_id:         IPSL
    variant_label:          r1i1p1f1
    data_specs_version:     01.00.28
    grid_label:             gn
    realization_index:      1
    tracking_id:            hdl:21.14100/dcd42bc5-cc58-4234-b58e-21e4b624ba04...
    institution:            Institut Pierre Simon Laplace, Paris 75252, France
    initialization_index:   1
    experiment:             update of RCP8.5 based on SSP5
    contact:                ipsl-cmip6@listes.ipsl.fr
    dr2xml_md5sum:          c4b76079137f2c3b9298396d121b21c1
    table_id:               Omon
    source_id:              IPSL-CM6A-LR
    CMIP6_CV_version:       cv=6.2.15.1
    parent_experiment_id:   historical
    Conventions:            CF-1.7 CMIP-6.2
    physics_index:          1
    dr2xml_version:         1.16
    model_version:          6.1.8
    EXPID:                  ssp585
    variant_info:           Each member starts from the corresponding member ...
#####################r2i1p1f1##################
r2i1p1f1

--> The keys in the returned dictionary of datasets are constructed as follows:
    'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r2i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r2i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions:         (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
    nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
    nav_lon         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
  * olevel          (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
  * time            (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
  * member_id       (member_id) <U8 'r2i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
    olevel_bounds   (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
    time_bounds     (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
    bounds_nav_lat  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    area            (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
    bounds_nav_lon  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    o2              (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
    name:                   /ccc/work/cont003/gencmip6/lurtont/IGCM_OUT/IPSLC...
    Conventions:            CF-1.7 CMIP-6.2
    creation_date:          2019-10-18T06:38:14Z
    tracking_id:            hdl:21.14100/d025e86c-aa6a-4d52-88ca-bc649f48233f
    description:            Future scenario with high radiative forcing by th...
    title:                  IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
    activity_id:            ScenarioMIP
    contact:                ipsl-cmip6@listes.ipsl.fr
    data_specs_version:     01.00.28
    dr2xml_version:         1.16
    experiment_id:          ssp585
    experiment:             update of RCP8.5 based on SSP5
    external_variables:     areacello volcello
    forcing_index:          1
    frequency:              mon
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
    grid:                   native ocean tri-polar grid with 105 k ocean cells
    grid_label:             gn
    nominal_resolution:     100 km
    history:                none
    initialization_index:   1
    institution_id:         IPSL
    institution:            Institut Pierre Simon Laplace, Paris 75252, France
    license:                CMIP6 model data produced by IPSL is licensed und...
    mip_era:                CMIP6
    parent_mip_era:         CMIP6
    parent_source_id:       IPSL-CM6A-LR
    parent_time_units:      days since 1850-01-01 00:00:00
    parent_variant_label:   r2i1p1f1
    branch_method:          standard
    branch_time_in_parent:  60265.0
    branch_time_in_child:   0.0
    physics_index:          1
    product:                model-output
    realization_index:      2
    realm:                  ocnBgchem
    source:                 IPSL-CM6A-LR (2017):  atmos: LMDZ (NPv6, N96; 144...
    source_id:              IPSL-CM6A-LR
    source_type:            AOGCM BGC
    sub_experiment_id:      none
    sub_experiment:         none
    table_id:               Omon
    variable_id:            o2
    variant_info:           Each member starts from the corresponding member ...
    variant_label:          r2i1p1f1
    EXPID:                  ssp585
    CMIP6_CV_version:       cv=6.2.15.1
    dr2xml_md5sum:          b6f602401512e82e2d7cadc2c6f36c2a
    model_version:          6.1.10
    parent_experiment_id:   historical
    parent_activity_id:     CMIP
    intake_esm_varname:     o2
#####################r3i1p1f1##################
r3i1p1f1

--> The keys in the returned dictionary of datasets are constructed as follows:
    'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r3i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r3i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions:         (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
    nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
    nav_lon         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
  * olevel          (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
  * time            (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
  * member_id       (member_id) <U8 'r3i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
    olevel_bounds   (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
    time_bounds     (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
    bounds_nav_lat  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    area            (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
    bounds_nav_lon  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    o2              (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
    name:                   /ccc/work/cont003/gencmip6/lurtont/IGCM_OUT/IPSLC...
    Conventions:            CF-1.7 CMIP-6.2
    creation_date:          2019-10-22T12:30:57Z
    tracking_id:            hdl:21.14100/edb14fa8-7e14-466c-9c58-012c75147c94
    description:            Future scenario with high radiative forcing by th...
    title:                  IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
    activity_id:            ScenarioMIP
    contact:                ipsl-cmip6@listes.ipsl.fr
    data_specs_version:     01.00.28
    dr2xml_version:         1.16
    experiment_id:          ssp585
    experiment:             update of RCP8.5 based on SSP5
    external_variables:     areacello volcello
    forcing_index:          1
    frequency:              mon
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
    grid:                   native ocean tri-polar grid with 105 k ocean cells
    grid_label:             gn
    nominal_resolution:     100 km
    history:                none
    initialization_index:   1
    institution_id:         IPSL
    institution:            Institut Pierre Simon Laplace, Paris 75252, France
    license:                CMIP6 model data produced by IPSL is licensed und...
    mip_era:                CMIP6
    parent_mip_era:         CMIP6
    parent_source_id:       IPSL-CM6A-LR
    parent_time_units:      days since 1850-01-01 00:00:00
    parent_variant_label:   r3i1p1f1
    branch_method:          standard
    branch_time_in_parent:  60265.0
    branch_time_in_child:   0.0
    physics_index:          1
    product:                model-output
    realization_index:      3
    realm:                  ocnBgchem
    source:                 IPSL-CM6A-LR (2017):  atmos: LMDZ (NPv6, N96; 144...
    source_id:              IPSL-CM6A-LR
    source_type:            AOGCM BGC
    sub_experiment_id:      none
    sub_experiment:         none
    table_id:               Omon
    variable_id:            o2
    variant_info:           Each member starts from the corresponding member ...
    variant_label:          r3i1p1f1
    EXPID:                  ssp585
    CMIP6_CV_version:       cv=6.2.15.1
    dr2xml_md5sum:          b6f602401512e82e2d7cadc2c6f36c2a
    model_version:          6.1.10
    parent_experiment_id:   historical
    parent_activity_id:     CMIP
    intake_esm_varname:     o2
#####################r4i1p1f1##################
r4i1p1f1

--> The keys in the returned dictionary of datasets are constructed as follows:
    'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r4i1p1f1/Omon/o2/gn/v20191122/o2_Omon_IPSL-CM6A-LR_ssp585_r4i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions:         (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
    nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
    nav_lon         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
  * olevel          (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
  * time            (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
  * member_id       (member_id) <U8 'r4i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
    olevel_bounds   (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
    time_bounds     (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
    bounds_nav_lat  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    area            (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
    bounds_nav_lon  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    o2              (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
    name:                   /ccc/work/cont003/gencmip6/dupontel/IGCM_OUT/IPSL...
    Conventions:            CF-1.7 CMIP-6.2
    creation_date:          2019-10-22T10:48:30Z
    tracking_id:            hdl:21.14100/2a4fb73b-f64c-46f5-81e2-4014c9505c26
    description:            Future scenario with high radiative forcing by th...
    title:                  IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
    activity_id:            ScenarioMIP
    contact:                ipsl-cmip6@listes.ipsl.fr
    data_specs_version:     01.00.28
    dr2xml_version:         1.16
    experiment_id:          ssp585
    experiment:             update of RCP8.5 based on SSP5
    external_variables:     areacello volcello
    forcing_index:          1
    frequency:              mon
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
    grid:                   native ocean tri-polar grid with 105 k ocean cells
    grid_label:             gn
    nominal_resolution:     100 km
    history:                none
    initialization_index:   1
    institution_id:         IPSL
    institution:            Institut Pierre Simon Laplace, Paris 75252, France
    license:                CMIP6 model data produced by IPSL is licensed und...
    mip_era:                CMIP6
    parent_mip_era:         CMIP6
    parent_source_id:       IPSL-CM6A-LR
    parent_time_units:      days since 1850-01-01 00:00:00
    parent_variant_label:   r4i1p1f1
    branch_method:          standard
    branch_time_in_parent:  60265.0
    branch_time_in_child:   0.0
    physics_index:          1
    product:                model-output
    realization_index:      4
    realm:                  ocnBgchem
    source:                 IPSL-CM6A-LR (2017):  atmos: LMDZ (NPv6, N96; 144...
    source_id:              IPSL-CM6A-LR
    source_type:            AOGCM BGC
    sub_experiment_id:      none
    sub_experiment:         none
    table_id:               Omon
    variable_id:            o2
    variant_info:           Each member starts from the corresponding member ...
    variant_label:          r4i1p1f1
    EXPID:                  ssp585
    CMIP6_CV_version:       cv=6.2.15.1
    dr2xml_md5sum:          b6f602401512e82e2d7cadc2c6f36c2a
    model_version:          6.1.10
    parent_experiment_id:   historical
    parent_activity_id:     CMIP
    intake_esm_varname:     o2
#####################r6i1p1f1##################
r6i1p1f1

--> The keys in the returned dictionary of datasets are constructed as follows:
    'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r6i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r6i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions:         (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
    nav_lat         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
    nav_lon         (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
  * olevel          (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
  * time            (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
  * member_id       (member_id) <U8 'r6i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
    olevel_bounds   (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
    time_bounds     (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
    bounds_nav_lat  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    area            (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
    bounds_nav_lon  (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
    o2              (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
    name:                   /ccc/work/cont003/gencmip6/lurtont/IGCM_OUT/IPSLC...
    Conventions:            CF-1.7 CMIP-6.2
    creation_date:          2019-10-22T10:17:28Z
    tracking_id:            hdl:21.14100/b7e9c317-cc62-4a89-b463-7c456231987d
    description:            Future scenario with high radiative forcing by th...
    title:                  IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
    activity_id:            ScenarioMIP
    contact:                ipsl-cmip6@listes.ipsl.fr
    data_specs_version:     01.00.28
    dr2xml_version:         1.16
    experiment_id:          ssp585
    experiment:             update of RCP8.5 based on SSP5
    external_variables:     areacello volcello
    forcing_index:          1
    frequency:              mon
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
    grid:                   native ocean tri-polar grid with 105 k ocean cells
    grid_label:             gn
    nominal_resolution:     100 km
    history:                none
    initialization_index:   1
    institution_id:         IPSL
    institution:            Institut Pierre Simon Laplace, Paris 75252, France
    license:                CMIP6 model data produced by IPSL is licensed und...
    mip_era:                CMIP6
    parent_mip_era:         CMIP6
    parent_source_id:       IPSL-CM6A-LR
    parent_time_units:      days since 1850-01-01 00:00:00
    parent_variant_label:   r6i1p1f1
    branch_method:          standard
    branch_time_in_parent:  60265.0
    branch_time_in_child:   0.0
    physics_index:          1
    product:                model-output
    realization_index:      6
    realm:                  ocnBgchem
    source:                 IPSL-CM6A-LR (2017):  atmos: LMDZ (NPv6, N96; 144...
    source_id:              IPSL-CM6A-LR
    source_type:            AOGCM BGC
    sub_experiment_id:      none
    sub_experiment:         none
    table_id:               Omon
    variable_id:            o2
    variant_info:           Each member starts from the corresponding member ...
    variant_label:          r6i1p1f1
    EXPID:                  ssp585
    CMIP6_CV_version:       cv=6.2.15.1
    dr2xml_md5sum:          b6f602401512e82e2d7cadc2c6f36c2a
    model_version:          6.1.10
    parent_experiment_id:   historical
    parent_activity_id:     CMIP
    intake_esm_varname:     o2

I really doesnt seem like there are duplicate times in these, unless I am missing something.

dcherian commented 4 years ago

@andersy005 @jbusecke can we get on a quick video call? Julius, wanna send out a zoom invite?

jbusecke commented 4 years ago

will do in 5!

andersy005 commented 4 years ago

@andersy005 @jbusecke can we get on a quick video call? Julius, wanna send out a zoom invite?

Sure.

@jbusecke, @sherimickelson

I just found out that there's a bug in the _pick_latest_version(df) function. Since Pandas does not propagate missing values (NaN) when doing groupby() (See https://github.com/pandas-dev/pandas/issues/3729), the following code (in _pick_latest_version(df)) ends up returning 0 groups since the dcpp_init_year column has missing values.

grpby = list(set(df.columns.tolist()) - {'path', 'version'})
groups = df.groupby(grpby)

As a result, the subsequent code in _pick_latest_version(df) doesn't actually work as expected:(

jbusecke commented 4 years ago

Check your email for the invite.

andersy005 commented 4 years ago

As an update, we found out that the issue was stemming from differences in calendar units used in the netCDF files. These differences were causing xarray to fail since it was trying to mix time values decoded with pandas and cftime together.

Solution: Specify use_cftime=True parameter:

cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585',member_id='r1i1p1f1')
ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True, 'use_cftime': True})

@jbusecke, let me know whether the solution I suggested above is accurate.

jbusecke commented 4 years ago

This still blows my notebook up (literally the whole thing, not just the kernel):

Note that I am aggregating all members.

The read in works as expected, but the plotting is still causing issues.

cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585')
ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True, 'use_cftime': True})
ds = ddict['ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Omon.gn']
ds.o2.isel(time=-1, olevel=0).plot(col='member_id')

Could you try this modified example on the ncar netdcdfs, by any chance? I am curious if this is an oddity with the files (they are gigantic for IPSL), or our system in Princeton.

dcherian commented 4 years ago

What does ds.o2 look like?

jbusecke commented 4 years ago

I can check that tomorrow. Cant afford another crash, since I have something else running. It looked fine though. Perhaps it has to do with the chunks, I will try to test tomorrow.

andersy005 commented 4 years ago

Could you try this modified example on the ncar netdcdfs, by any chance? I am curious if this is an oddity with the files (they are gigantic for IPSL), or our system in Princeton.

I will give this a try on Cheyenne, and will let you know how it goes

andersy005 commented 4 years ago

What does ds.o2 look like?

Here's what I am getting:

andersy005 commented 4 years ago

When I tried executing ds.o2.isel(time=-1, olevel=0).plot(col='member_id')

I got the following error ( I don't understand what's going on)

```python --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in ----> 1 ds.o2.isel(time=-1, olevel=0).plot(col='member_id') /glade/work/abanihi/softwares/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/xarray/plot/plot.py in __call__(self, **kwargs) 444 445 def __call__(self, **kwargs): --> 446 return plot(self._da, **kwargs) 447 448 @functools.wraps(hist) /glade/work/abanihi/softwares/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/xarray/plot/plot.py in plot(darray, row, col, col_wrap, ax, hue, rtol, subplot_kws, **kwargs) 198 kwargs["ax"] = ax 199 --> 200 return plotfunc(darray, **kwargs) 201 202 /glade/work/abanihi/softwares/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/xarray/plot/plot.py in newplotfunc(darray, x, y, figsize, size, aspect, ax, row, col, col_wrap, xincrease, yincrease, add_colorbar, add_labels, vmin, vmax, cmap, center, robust, extend, levels, infer_intervals, colors, subplot_kws, cbar_ax, cbar_kwargs, xscale, yscale, xticks, yticks, xlim, ylim, norm, **kwargs) 631 # Need the decorated plotting function 632 allargs["plotfunc"] = globals()[plotfunc.__name__] --> 633 return _easy_facetgrid(darray, kind="dataarray", **allargs) 634 635 plt = import_matplotlib_pyplot() /glade/work/abanihi/softwares/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/xarray/plot/facetgrid.py in _easy_facetgrid(data, plotfunc, kind, x, y, row, col, col_wrap, sharex, sharey, aspect, size, subplot_kws, ax, figsize, **kwargs) 623 raise ValueError("cannot provide both `figsize` and `size` arguments") 624 --> 625 g = FacetGrid( 626 data=data, 627 col=col, /glade/work/abanihi/softwares/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/xarray/plot/facetgrid.py in __init__(self, data, col, row, col_wrap, sharex, sharey, figsize, aspect, size, subplot_kws) 117 118 # Handle corner case of nonunique coordinates --> 119 rep_col = col is not None and not data[col].to_index().is_unique 120 rep_row = row is not None and not data[row].to_index().is_unique 121 if rep_col or rep_row: /glade/work/abanihi/softwares/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/xarray/core/dataarray.py in to_index(self) 570 arrays. 571 """ --> 572 return self.variable.to_index() 573 574 @property /glade/work/abanihi/softwares/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/xarray/core/variable.py in to_index(self) 468 def to_index(self): 469 """Convert this variable to a pandas.Index""" --> 470 return self.to_index_variable().to_index() 471 472 def to_dict(self, data=True): /glade/work/abanihi/softwares/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/xarray/core/variable.py in to_index_variable(self) 460 def to_index_variable(self): 461 """Return this variable as an xarray.IndexVariable""" --> 462 return IndexVariable( 463 self.dims, self._data, self._attrs, encoding=self._encoding, fastpath=True 464 ) /glade/work/abanihi/softwares/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/xarray/core/variable.py in __init__(self, dims, data, attrs, encoding, fastpath) 2086 super().__init__(dims, data, attrs, encoding, fastpath) 2087 if self.ndim != 1: -> 2088 raise ValueError("%s objects must be 1-dimensional" % type(self).__name__) 2089 2090 # Unlike in Variable, always eagerly load values into memory ValueError: IndexVariable objects must be 1-dimensional ``` The dataarray in questions looks like this: Screen Shot 2020-05-04 at 7 36 27 PM

Any way, I made some changes to the plotting command and I got everything to work (my kernel didn't die :)):

jbusecke commented 4 years ago

When I tried executing ds.o2.isel(time=-1, olevel=0).plot(col='member_id')

I believe this is only a single member dataset. My kernel dies when I try to do this plot command with several aggregated members. I assume that something happens during the aggregation of the (different length) members?

dcherian commented 4 years ago

I can look into it if you can provide an example notebook. Aren't these datasets all on glade too?

jbusecke commented 4 years ago

This should reproduce it if you have an intake catalog set up that has the full data ( I was hoping this is the case on glade).(The cloud data is truncated).

import xarray as xr
import intake

col = intake.intake.open_esm_datastore(...)
cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585')
ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True, 'use_cftime': True})
ds = ddict['ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Omon.gn']
ds.o2.isel(time=-1, olevel=0).plot(col='member_id')

jbusecke commented 4 years ago

I have found an intermediate solution for this by modifying the preprocess function to chop of any time values that go beyond 2100:

def preprocess(ds):
    ds = ds.copy()
    if 'ssp' in ds.attrs['experiment_id']:
        ds = ds.sel(time=slice(None, '2100'))
    return ds

If you replace the above line with ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True, 'use_cftime': True}, preprocess=preprocess) this should work without crashing. Hopefully this problem can be fixed upstream, since this basically discards those data. But I thought I post it for anyone who might have the same problem

jbusecke commented 4 years ago

I am running into this problem again and again. Since it seems the upstream fix seems to not be super straight forward, I was wondering if we could alleviate the situation with some additional functionality here.

In most cases, the problem is caused by 1-2 members that are significantly shorter or longer than the others. I would be ok to ditch these for now and continue the analysis with fewer members until the problem is fixed upstream.

Is there a way to evaluate the dimension shape of all datasets before they are combined alnong a specific dimension (e.g. member_id) and have an option like drop_time_mismatch='member_id', to eliminate the ones that do not agree with the majority size (the length of the time dimension found most often in the pool of members?).

jbusecke commented 4 years ago

Just blew up another notebook kernel of mine. Ill try to come up with a manual fix for now.

I was thinking along these lines:

Get full df from esm-datastore
Groupby each row except for member_id (I am working on zarr stores that have already been concatenated in time)
Check dimensionality for each store, pick the most common set of dimensions
Discard other members from dataframe
Replace the dataframe and read in as usual.

I can report back once I get this working. Any comments are much appreciated.

dcherian commented 4 years ago

Dask will fix this: https://github.com/dask/dask/pull/6514

dcherian commented 4 years ago

Should be fixed in the dask release this Friday. Julius, please reopen if you run into it again.

jbusecke commented 4 years ago

Dooooope! Thanks so much. Ill check on it this week for sure.

intake / intake-esm

Problems combining members with different runtime #225