Closed jbusecke closed 4 years ago
Is there a way to make this work?
Yes, this is doable. This seems to be related to the time alignment issue in https://github.com/NCAR/intake-esm/issues/197. Earlier versions of intake-esm
(prior v2019.10) used to do time alignment for use cases in which one ensemble member had extended time axis. If nobody beats me to it, I will look into this in the coming weeks (starting by next week).
Any suggestions for solving this?
Here's the code that did time alignment
in earlier versions of intake-esm
:
Cc @dcherian in case he has suggestions on how to address this issue.
One of those datasets has duplicate time values. Fix that and things should work.
@dcherian, that was my first thought too, but I was not able to find duplicate times, when loading each file individually, so I assumed the duplicates are created internally somehow? I will check again carefully...
Did you use ds.indexes["time"].is_unique
to check?
Oh no. I just compared len(ds.time)
and len(np.unique(ds.time.data))
and they are the same lenght. Let me check with your method...
Yes I can confirm that each dataset has unique time values. So presumable something inside intake-esm causes a duplication?
I also checked if I can read all other members (except the long one), and that works perfectly fine.
I guess it could be a different dimension
Or not. The error says "time" hmm
I'm just learning intake-esm so forgive me if this question is misdirected, but I was looking at your query results and I noticed something that doesn't match up. I looked up the available o2 files from IPSL on ESGF for ssp585 (https://esgf-node.llnl.gov/search/cmip6/?source_id=IPSL-CM6A-LR&variable_id=o2&experiment_id=ssp585) and the results are different. While ensemble members 14, 2, 3, 4, and 6 all have one file to download (years 2015-2100), ensemble member 1 has 3 files to download (years 2015-2300). Ensemble member 1 doesn't match because it has 4 query entries listed in your error report: 0 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r14i1p1f1 Omon 1 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 2 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 3 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 4 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r1i1p1f1 Omon 5 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r2i1p1f1 Omon 6 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r3i1p1f1 Omon 7 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r4i1p1f1 Omon 8 ScenarioMIP IPSL IPSL-CM6A-LR ssp585 r6i1p1f1 Omon Do you have a duplicate file that is being found and this is causing duplicates in the time dimension? Or is the error within intake-esm?
@sherimickelson yes that is right. It seems that this particular model ran an extended ssp585
scenario for member r1i1p1f1
, but I am unable to 'pick that out' with the given functionality in intake-esm.
Ideally I would like intake-esm in this case to concat the members in time first and then merge them (producing missing values for 2100-2300 for all other members...
Another solution would be to somehow not use the extended time (on the pangeo cloud, these files were just removed), but for my specific science application this long run is extremely interesting, so I would love to investigate it!
@jbusecke, what's the output of
cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585')
cat.df['path'].tolist()
???
Also, is the CSV from /tigress/GEOCLIM/LRGROUP/jbusecke/code/intake-esm-datastore/catalogs/tigressdata-cmip6.json
publicly accessible?
The output is:
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r14i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r14i1p1f1_gn_201501-210012.nc',
'/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc',
'/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc',
'/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_210101-220012.nc',
'/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_220101-230012.nc',
'/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r2i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r2i1p1f1_gn_201501-210012.nc',
'/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r3i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r3i1p1f1_gn_201501-210012.nc',
'/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r4i1p1f1/Omon/o2/gn/v20191122/o2_Omon_IPSL-CM6A-LR_ssp585_r4i1p1f1_gn_201501-210012.nc',
'/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r6i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r6i1p1f1_gn_201501-210012.nc']
The csv is not publicly accessible at the moment, but I could send it to you if that works? Do you need access to the data to debug? Or just the csv?
Check your email for the files.
@jbusecke It looks like your catalog includes both versions of this file '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', It looks like that's where the duplication in time is happening. You might want to make sure it's not using version v20190119 because the other files in that time series are using v20190903.
@sherimickelson, thank you for catching the version issue
@jbusecke, when you built the catalog, did you use the --pick-latest-version
flag?
As Sheri pointed out, it appears that for some files, there's more than one version:
/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc
, ---> version=v20190119
/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc
, ---> version=v20190903
Ohh that would make sense I guess. But I did invoke the flag you mentioned. Here is the full command:
python cmip.py --root-path /tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6 --pick-latest-version --cmip-version 6 --csv-filepath ../catalogs/tigressdata-cmip6.csv.gz --depth 4
Should I rebuild the catalog once more?
Should I rebuild the catalog once more?
Not yet. I am looking into the csv you just sent me to make sure that --pick-latest-version
flag is working as expected.
@andersy005 Does the latest flag only work on directories called "latest" like we have on glade? Or does it look and see which version number is higher? If it only does the first, that could be the issue because neither are in a "latest" directory. (I can't remember what the code does)
Does the latest flag only work on directories called "latest" like we have on glade? Or does it look and see which version number is higher?
It looks for which version number is higher.
def _pick_latest_version(df):
import itertools
grpby = list(set(df.columns.tolist()) - {'path', 'version'})
groups = df.groupby(grpby)
@dask.delayed
def _pick_latest_v(group):
idx = []
if group.version.nunique() > 1:
idx = group.sort_values(by=['version'], ascending=False).index[1:].values.tolist()
return idx
idx_to_remove = [_pick_latest_v(group) for _, group in groups]
print('Getting latest version...\n')
with ProgressBar():
idx_to_remove = dask.compute(*idx_to_remove)
idx_to_remove = list(itertools.chain(*idx_to_remove))
df = df.drop(index=idx_to_remove)
print('\nDone....\n')
return df
Not sure if this is relevant, but if I specify member_id='r1i1p1f1'
, it still registers the duplicate files:
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_210101-220012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_220101-230012.nc']
yet it combines the files without complaining...:
<xarray.Dataset>
Dimensions: (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 3432, x: 362, y: 332)
Coordinates:
nav_lat (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
nav_lon (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
* olevel (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
* time (time) object 2015-01-16T12:00:00 ... 2300-12-16 12:00:00
* member_id (member_id) <U8 'r1i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
olevel_bounds (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
time_bounds (time, axis_nbounds) object dask.array<chunksize=(6, 2), meta=np.ndarray>
bounds_nav_lat (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
area (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
bounds_nav_lon (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
o2 (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
branch_time_in_parent: 60265.0
parent_source_id: IPSL-CM6A-LR
frequency: mon
source_type: AOGCM BGC
realm: ocnBgchem
branch_method: standard
sub_experiment_id: none
parent_variant_label: r1i1p1f1
description: Future scenario with high radiative forcing by th...
external_variables: areacello volcello
history: none
sub_experiment: none
branch_time_in_child: 0.0
source: IPSL-CM6A-LR (2017): atmos: LMDZ (NPv6, N96; 144...
creation_date: 2019-08-22T22:26:32Z
mip_era: CMIP6
intake_esm_varname: o2
forcing_index: 1
name: /ccc/work/cont003/gencmip6/oboucher/IGCM_OUT/IPSL...
grid: native ocean tri-polar grid with 105 k ocean cells
parent_time_units: days since 1850-01-01 00:00:00
variable_id: o2
license: CMIP6 model data produced by IPSL is licensed und...
parent_activity_id: CMIP
nominal_resolution: 100 km
experiment_id: ssp585
activity_id: ScenarioMIP
parent_mip_era: CMIP6
title: IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
product: model-output
further_info_url: https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
institution_id: IPSL
variant_label: r1i1p1f1
data_specs_version: 01.00.28
grid_label: gn
realization_index: 1
tracking_id: hdl:21.14100/dcd42bc5-cc58-4234-b58e-21e4b624ba04...
institution: Institut Pierre Simon Laplace, Paris 75252, France
initialization_index: 1
experiment: update of RCP8.5 based on SSP5
contact: ipsl-cmip6@listes.ipsl.fr
dr2xml_md5sum: c4b76079137f2c3b9298396d121b21c1
table_id: Omon
source_id: IPSL-CM6A-LR
CMIP6_CV_version: cv=6.2.15.1
parent_experiment_id: historical
Conventions: CF-1.7 CMIP-6.2
physics_index: 1
dr2xml_version: 1.16
model_version: 6.1.8
EXPID: ssp585
variant_info: Each member starts from the corresponding member ...
It will combine but check .indexes["time"].is_unique
of the combined result...
@jbusecke, as a temporary solution, try the following and see what happens:
url = "/tigress/GEOCLIM/LRGROUP/jbusecke/code/intake-esm-datastore/catalogs/tigressdata-cmip6.json"
col = intake.open_esm_datastore(url)
cat = col.search(source_id="IPSL-CM6A-LR", variable_id="o2", experiment_id="ssp585")
cat.df = cat.df.drop(index=[1]) # Drop the problematic file
ddict = cat.to_dataset_dict(cdf_kwargs={"chunks": {"time": 6}, "decode_times": True,})
@dcherian this is what I did:
members = cat.df['member_id'].unique()
for member in members:
print(f'#####################{member}##################')
print(member)
cat_single = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585',member_id=member)
ddict = cat_single.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True,})
assert len(ddict.keys()) == 1
_,ds = ddict.popitem()
print(ds.indexes["time"].is_unique)
print(f"{len(ds.time)}/{len(np.unique(ds.time.data))}")
print(cat_single.df['path'].tolist())
print(ds)
I get:
#####################r14i1p1f1##################
r14i1p1f1
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r14i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r14i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions: (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
nav_lat (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
nav_lon (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
* olevel (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
* time (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
* member_id (member_id) <U9 'r14i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
olevel_bounds (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
time_bounds (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
bounds_nav_lat (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
area (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
bounds_nav_lon (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
o2 (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
name: /ccc/work/cont003/gencmip6/oboucher/IGCM_OUT/IPSL...
Conventions: CF-1.7 CMIP-6.2
creation_date: 2019-10-13T08:15:27Z
tracking_id: hdl:21.14100/fefe8ea3-d0b3-4828-ba16-b66fad793928
description: Future scenario with high radiative forcing by th...
title: IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
activity_id: ScenarioMIP
contact: ipsl-cmip6@listes.ipsl.fr
data_specs_version: 01.00.28
dr2xml_version: 1.16
experiment_id: ssp585
experiment: update of RCP8.5 based on SSP5
external_variables: areacello volcello
forcing_index: 1
frequency: mon
further_info_url: https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
grid: native ocean tri-polar grid with 105 k ocean cells
grid_label: gn
nominal_resolution: 100 km
history: none
initialization_index: 1
institution_id: IPSL
institution: Institut Pierre Simon Laplace, Paris 75252, France
license: CMIP6 model data produced by IPSL is licensed und...
mip_era: CMIP6
parent_mip_era: CMIP6
parent_source_id: IPSL-CM6A-LR
parent_time_units: days since 1850-01-01 00:00:00
parent_variant_label: r14i1p1f1
branch_method: standard
branch_time_in_parent: 60265.0
branch_time_in_child: 0.0
physics_index: 1
product: model-output
realization_index: 14
realm: ocnBgchem
source: IPSL-CM6A-LR (2017): atmos: LMDZ (NPv6, N96; 144...
source_id: IPSL-CM6A-LR
source_type: AOGCM BGC
sub_experiment_id: none
sub_experiment: none
table_id: Omon
variable_id: o2
variant_info: Each member starts from the corresponding member ...
variant_label: r14i1p1f1
EXPID: ssp585
CMIP6_CV_version: cv=6.2.15.1
dr2xml_md5sum: b6f602401512e82e2d7cadc2c6f36c2a
model_version: 6.1.10
parent_experiment_id: historical
parent_activity_id: CMIP
intake_esm_varname: o2
#####################r1i1p1f1##################
r1i1p1f1
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
--> There is/are 1 group(s)
/tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/coding/times.py:426: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
/tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/xarray/coding/times.py:426: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
/tigress/jbusecke/code/conda/envs/euc_dynamics/lib/python3.8/site-packages/numpy/core/_asarray.py:85: SerializationWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using cftime.datetime objects instead, reason: dates out of range
return array(a, dtype, copy=False, order=order)
True
3432/3432
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190119/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_201501-210012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_210101-220012.nc', '/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/Omon/o2/gn/v20190903/o2_Omon_IPSL-CM6A-LR_ssp585_r1i1p1f1_gn_220101-230012.nc']
<xarray.Dataset>
Dimensions: (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 3432, x: 362, y: 332)
Coordinates:
nav_lat (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
nav_lon (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
* olevel (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
* time (time) object 2015-01-16T12:00:00 ... 2300-12-16 12:00:00
* member_id (member_id) <U8 'r1i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
olevel_bounds (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
time_bounds (time, axis_nbounds) object dask.array<chunksize=(6, 2), meta=np.ndarray>
bounds_nav_lat (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
area (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
bounds_nav_lon (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
o2 (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
branch_time_in_parent: 60265.0
parent_source_id: IPSL-CM6A-LR
frequency: mon
source_type: AOGCM BGC
realm: ocnBgchem
branch_method: standard
sub_experiment_id: none
parent_variant_label: r1i1p1f1
description: Future scenario with high radiative forcing by th...
external_variables: areacello volcello
history: none
sub_experiment: none
branch_time_in_child: 0.0
source: IPSL-CM6A-LR (2017): atmos: LMDZ (NPv6, N96; 144...
creation_date: 2019-08-22T22:26:32Z
mip_era: CMIP6
intake_esm_varname: o2
forcing_index: 1
name: /ccc/work/cont003/gencmip6/oboucher/IGCM_OUT/IPSL...
grid: native ocean tri-polar grid with 105 k ocean cells
parent_time_units: days since 1850-01-01 00:00:00
variable_id: o2
license: CMIP6 model data produced by IPSL is licensed und...
parent_activity_id: CMIP
nominal_resolution: 100 km
experiment_id: ssp585
activity_id: ScenarioMIP
parent_mip_era: CMIP6
title: IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
product: model-output
further_info_url: https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
institution_id: IPSL
variant_label: r1i1p1f1
data_specs_version: 01.00.28
grid_label: gn
realization_index: 1
tracking_id: hdl:21.14100/dcd42bc5-cc58-4234-b58e-21e4b624ba04...
institution: Institut Pierre Simon Laplace, Paris 75252, France
initialization_index: 1
experiment: update of RCP8.5 based on SSP5
contact: ipsl-cmip6@listes.ipsl.fr
dr2xml_md5sum: c4b76079137f2c3b9298396d121b21c1
table_id: Omon
source_id: IPSL-CM6A-LR
CMIP6_CV_version: cv=6.2.15.1
parent_experiment_id: historical
Conventions: CF-1.7 CMIP-6.2
physics_index: 1
dr2xml_version: 1.16
model_version: 6.1.8
EXPID: ssp585
variant_info: Each member starts from the corresponding member ...
#####################r2i1p1f1##################
r2i1p1f1
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r2i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r2i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions: (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
nav_lat (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
nav_lon (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
* olevel (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
* time (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
* member_id (member_id) <U8 'r2i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
olevel_bounds (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
time_bounds (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
bounds_nav_lat (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
area (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
bounds_nav_lon (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
o2 (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
name: /ccc/work/cont003/gencmip6/lurtont/IGCM_OUT/IPSLC...
Conventions: CF-1.7 CMIP-6.2
creation_date: 2019-10-18T06:38:14Z
tracking_id: hdl:21.14100/d025e86c-aa6a-4d52-88ca-bc649f48233f
description: Future scenario with high radiative forcing by th...
title: IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
activity_id: ScenarioMIP
contact: ipsl-cmip6@listes.ipsl.fr
data_specs_version: 01.00.28
dr2xml_version: 1.16
experiment_id: ssp585
experiment: update of RCP8.5 based on SSP5
external_variables: areacello volcello
forcing_index: 1
frequency: mon
further_info_url: https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
grid: native ocean tri-polar grid with 105 k ocean cells
grid_label: gn
nominal_resolution: 100 km
history: none
initialization_index: 1
institution_id: IPSL
institution: Institut Pierre Simon Laplace, Paris 75252, France
license: CMIP6 model data produced by IPSL is licensed und...
mip_era: CMIP6
parent_mip_era: CMIP6
parent_source_id: IPSL-CM6A-LR
parent_time_units: days since 1850-01-01 00:00:00
parent_variant_label: r2i1p1f1
branch_method: standard
branch_time_in_parent: 60265.0
branch_time_in_child: 0.0
physics_index: 1
product: model-output
realization_index: 2
realm: ocnBgchem
source: IPSL-CM6A-LR (2017): atmos: LMDZ (NPv6, N96; 144...
source_id: IPSL-CM6A-LR
source_type: AOGCM BGC
sub_experiment_id: none
sub_experiment: none
table_id: Omon
variable_id: o2
variant_info: Each member starts from the corresponding member ...
variant_label: r2i1p1f1
EXPID: ssp585
CMIP6_CV_version: cv=6.2.15.1
dr2xml_md5sum: b6f602401512e82e2d7cadc2c6f36c2a
model_version: 6.1.10
parent_experiment_id: historical
parent_activity_id: CMIP
intake_esm_varname: o2
#####################r3i1p1f1##################
r3i1p1f1
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r3i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r3i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions: (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
nav_lat (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
nav_lon (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
* olevel (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
* time (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
* member_id (member_id) <U8 'r3i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
olevel_bounds (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
time_bounds (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
bounds_nav_lat (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
area (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
bounds_nav_lon (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
o2 (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
name: /ccc/work/cont003/gencmip6/lurtont/IGCM_OUT/IPSLC...
Conventions: CF-1.7 CMIP-6.2
creation_date: 2019-10-22T12:30:57Z
tracking_id: hdl:21.14100/edb14fa8-7e14-466c-9c58-012c75147c94
description: Future scenario with high radiative forcing by th...
title: IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
activity_id: ScenarioMIP
contact: ipsl-cmip6@listes.ipsl.fr
data_specs_version: 01.00.28
dr2xml_version: 1.16
experiment_id: ssp585
experiment: update of RCP8.5 based on SSP5
external_variables: areacello volcello
forcing_index: 1
frequency: mon
further_info_url: https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
grid: native ocean tri-polar grid with 105 k ocean cells
grid_label: gn
nominal_resolution: 100 km
history: none
initialization_index: 1
institution_id: IPSL
institution: Institut Pierre Simon Laplace, Paris 75252, France
license: CMIP6 model data produced by IPSL is licensed und...
mip_era: CMIP6
parent_mip_era: CMIP6
parent_source_id: IPSL-CM6A-LR
parent_time_units: days since 1850-01-01 00:00:00
parent_variant_label: r3i1p1f1
branch_method: standard
branch_time_in_parent: 60265.0
branch_time_in_child: 0.0
physics_index: 1
product: model-output
realization_index: 3
realm: ocnBgchem
source: IPSL-CM6A-LR (2017): atmos: LMDZ (NPv6, N96; 144...
source_id: IPSL-CM6A-LR
source_type: AOGCM BGC
sub_experiment_id: none
sub_experiment: none
table_id: Omon
variable_id: o2
variant_info: Each member starts from the corresponding member ...
variant_label: r3i1p1f1
EXPID: ssp585
CMIP6_CV_version: cv=6.2.15.1
dr2xml_md5sum: b6f602401512e82e2d7cadc2c6f36c2a
model_version: 6.1.10
parent_experiment_id: historical
parent_activity_id: CMIP
intake_esm_varname: o2
#####################r4i1p1f1##################
r4i1p1f1
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r4i1p1f1/Omon/o2/gn/v20191122/o2_Omon_IPSL-CM6A-LR_ssp585_r4i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions: (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
nav_lat (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
nav_lon (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
* olevel (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
* time (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
* member_id (member_id) <U8 'r4i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
olevel_bounds (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
time_bounds (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
bounds_nav_lat (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
area (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
bounds_nav_lon (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
o2 (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
name: /ccc/work/cont003/gencmip6/dupontel/IGCM_OUT/IPSL...
Conventions: CF-1.7 CMIP-6.2
creation_date: 2019-10-22T10:48:30Z
tracking_id: hdl:21.14100/2a4fb73b-f64c-46f5-81e2-4014c9505c26
description: Future scenario with high radiative forcing by th...
title: IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
activity_id: ScenarioMIP
contact: ipsl-cmip6@listes.ipsl.fr
data_specs_version: 01.00.28
dr2xml_version: 1.16
experiment_id: ssp585
experiment: update of RCP8.5 based on SSP5
external_variables: areacello volcello
forcing_index: 1
frequency: mon
further_info_url: https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
grid: native ocean tri-polar grid with 105 k ocean cells
grid_label: gn
nominal_resolution: 100 km
history: none
initialization_index: 1
institution_id: IPSL
institution: Institut Pierre Simon Laplace, Paris 75252, France
license: CMIP6 model data produced by IPSL is licensed und...
mip_era: CMIP6
parent_mip_era: CMIP6
parent_source_id: IPSL-CM6A-LR
parent_time_units: days since 1850-01-01 00:00:00
parent_variant_label: r4i1p1f1
branch_method: standard
branch_time_in_parent: 60265.0
branch_time_in_child: 0.0
physics_index: 1
product: model-output
realization_index: 4
realm: ocnBgchem
source: IPSL-CM6A-LR (2017): atmos: LMDZ (NPv6, N96; 144...
source_id: IPSL-CM6A-LR
source_type: AOGCM BGC
sub_experiment_id: none
sub_experiment: none
table_id: Omon
variable_id: o2
variant_info: Each member starts from the corresponding member ...
variant_label: r4i1p1f1
EXPID: ssp585
CMIP6_CV_version: cv=6.2.15.1
dr2xml_md5sum: b6f602401512e82e2d7cadc2c6f36c2a
model_version: 6.1.10
parent_experiment_id: historical
parent_activity_id: CMIP
intake_esm_varname: o2
#####################r6i1p1f1##################
r6i1p1f1
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
--> There is/are 1 group(s)
True
1032/1032
['/tiger/scratch/gpfs/GEOCLIM/synda/data/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r6i1p1f1/Omon/o2/gn/v20191121/o2_Omon_IPSL-CM6A-LR_ssp585_r6i1p1f1_gn_201501-210012.nc']
<xarray.Dataset>
Dimensions: (axis_nbounds: 2, member_id: 1, nvertex: 4, olevel: 75, time: 1032, x: 362, y: 332)
Coordinates:
nav_lat (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
nav_lon (y, x) float32 dask.array<chunksize=(332, 362), meta=np.ndarray>
* olevel (olevel) float32 0.50576 1.5558553 ... 5698.0605 5902.0576
* time (time) datetime64[ns] 2015-01-16T12:00:00 ... 2100-12-16T12:00:00
* member_id (member_id) <U8 'r6i1p1f1'
Dimensions without coordinates: axis_nbounds, nvertex, x, y
Data variables:
olevel_bounds (time, olevel, axis_nbounds) float32 dask.array<chunksize=(1032, 75, 2), meta=np.ndarray>
time_bounds (time, axis_nbounds) datetime64[ns] dask.array<chunksize=(6, 2), meta=np.ndarray>
bounds_nav_lat (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
area (time, y, x) float32 dask.array<chunksize=(1032, 332, 362), meta=np.ndarray>
bounds_nav_lon (time, y, x, nvertex) float32 dask.array<chunksize=(1032, 332, 362, 4), meta=np.ndarray>
o2 (member_id, time, olevel, y, x) float32 dask.array<chunksize=(1, 6, 75, 332, 362), meta=np.ndarray>
Attributes:
name: /ccc/work/cont003/gencmip6/lurtont/IGCM_OUT/IPSLC...
Conventions: CF-1.7 CMIP-6.2
creation_date: 2019-10-22T10:17:28Z
tracking_id: hdl:21.14100/b7e9c317-cc62-4a89-b463-7c456231987d
description: Future scenario with high radiative forcing by th...
title: IPSL-CM6A-LR model output prepared for CMIP6 / Sc...
activity_id: ScenarioMIP
contact: ipsl-cmip6@listes.ipsl.fr
data_specs_version: 01.00.28
dr2xml_version: 1.16
experiment_id: ssp585
experiment: update of RCP8.5 based on SSP5
external_variables: areacello volcello
forcing_index: 1
frequency: mon
further_info_url: https://furtherinfo.es-doc.org/CMIP6.IPSL.IPSL-CM...
grid: native ocean tri-polar grid with 105 k ocean cells
grid_label: gn
nominal_resolution: 100 km
history: none
initialization_index: 1
institution_id: IPSL
institution: Institut Pierre Simon Laplace, Paris 75252, France
license: CMIP6 model data produced by IPSL is licensed und...
mip_era: CMIP6
parent_mip_era: CMIP6
parent_source_id: IPSL-CM6A-LR
parent_time_units: days since 1850-01-01 00:00:00
parent_variant_label: r6i1p1f1
branch_method: standard
branch_time_in_parent: 60265.0
branch_time_in_child: 0.0
physics_index: 1
product: model-output
realization_index: 6
realm: ocnBgchem
source: IPSL-CM6A-LR (2017): atmos: LMDZ (NPv6, N96; 144...
source_id: IPSL-CM6A-LR
source_type: AOGCM BGC
sub_experiment_id: none
sub_experiment: none
table_id: Omon
variable_id: o2
variant_info: Each member starts from the corresponding member ...
variant_label: r6i1p1f1
EXPID: ssp585
CMIP6_CV_version: cv=6.2.15.1
dr2xml_md5sum: b6f602401512e82e2d7cadc2c6f36c2a
model_version: 6.1.10
parent_experiment_id: historical
parent_activity_id: CMIP
intake_esm_varname: o2
I really doesnt seem like there are duplicate times in these, unless I am missing something.
@andersy005 @jbusecke can we get on a quick video call? Julius, wanna send out a zoom invite?
will do in 5!
@andersy005 @jbusecke can we get on a quick video call? Julius, wanna send out a zoom invite?
Sure.
@jbusecke, @sherimickelson
I just found out that there's a bug in the _pick_latest_version(df)
function. Since Pandas does not propagate missing values (NaN
) when doing groupby()
(See https://github.com/pandas-dev/pandas/issues/3729), the following code (in _pick_latest_version(df)
) ends up returning 0 groups since the dcpp_init_year
column has missing values.
grpby = list(set(df.columns.tolist()) - {'path', 'version'})
groups = df.groupby(grpby)
As a result, the subsequent code in _pick_latest_version(df)
doesn't actually work as expected:(
Check your email for the invite.
As an update, we found out that the issue was stemming from differences in calendar units used in the netCDF files. These differences were causing xarray to fail since it was trying to mix time values decoded with pandas and cftime together.
Solution: Specify use_cftime=True
parameter:
cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585',member_id='r1i1p1f1')
ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True, 'use_cftime': True})
@jbusecke, let me know whether the solution I suggested above is accurate.
This still blows my notebook up (literally the whole thing, not just the kernel):
Note that I am aggregating all members.
The read in works as expected, but the plotting is still causing issues.
cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585')
ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True, 'use_cftime': True})
ds = ddict['ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Omon.gn']
ds.o2.isel(time=-1, olevel=0).plot(col='member_id')
Could you try this modified example on the ncar netdcdfs, by any chance? I am curious if this is an oddity with the files (they are gigantic for IPSL), or our system in Princeton.
What does ds.o2
look like?
I can check that tomorrow. Cant afford another crash, since I have something else running. It looked fine though. Perhaps it has to do with the chunks, I will try to test tomorrow.
Could you try this modified example on the ncar netdcdfs, by any chance? I am curious if this is an oddity with the files (they are gigantic for IPSL), or our system in Princeton.
I will give this a try on Cheyenne, and will let you know how it goes
What does ds.o2 look like?
Here's what I am getting:
When I tried executing ds.o2.isel(time=-1, olevel=0).plot(col='member_id')
Any way, I made some changes to the plotting command and I got everything to work (my kernel didn't die :)):
When I tried executing ds.o2.isel(time=-1, olevel=0).plot(col='member_id')
I believe this is only a single member dataset. My kernel dies when I try to do this plot command with several aggregated members. I assume that something happens during the aggregation of the (different length) members?
I can look into it if you can provide an example notebook. Aren't these datasets all on glade too?
This should reproduce it if you have an intake catalog set up that has the full data ( I was hoping this is the case on glade).(The cloud data is truncated).
import xarray as xr
import intake
col = intake.intake.open_esm_datastore(...)
cat = col.search(source_id='IPSL-CM6A-LR', variable_id='o2', experiment_id='ssp585')
ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True, 'use_cftime': True})
ds = ddict['ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Omon.gn']
ds.o2.isel(time=-1, olevel=0).plot(col='member_id')
I have found an intermediate solution for this by modifying the preprocess function to chop of any time values that go beyond 2100:
def preprocess(ds):
ds = ds.copy()
if 'ssp' in ds.attrs['experiment_id']:
ds = ds.sel(time=slice(None, '2100'))
return ds
If you replace the above line with ddict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':6},'decode_times': True, 'use_cftime': True}, preprocess=preprocess)
this should work without crashing.
Hopefully this problem can be fixed upstream, since this basically discards those data. But I thought I post it for anyone who might have the same problem
I am running into this problem again and again. Since it seems the upstream fix seems to not be super straight forward, I was wondering if we could alleviate the situation with some additional functionality here.
In most cases, the problem is caused by 1-2 members that are significantly shorter or longer than the others. I would be ok to ditch these for now and continue the analysis with fewer members until the problem is fixed upstream.
Is there a way to evaluate the dimension shape of all datasets before they are combined alnong a specific dimension (e.g. member_id
) and have an option like drop_time_mismatch='member_id'
, to eliminate the ones that do not agree with the majority size (the length of the time dimension found most often in the pool of members?).
Just blew up another notebook kernel of mine. Ill try to come up with a manual fix for now.
I was thinking along these lines:
I can report back once I get this working. Any comments are much appreciated.
Dask will fix this: https://github.com/dask/dask/pull/6514
Should be fixed in the dask release this Friday. Julius, please reopen if you run into it again.
Dooooope! Thanks so much. Ill check on it this week for sure.
I have had some trouble reading the 'IPSL' model with intake esm locally ( on the Princeton Server
tigressdata
). When I am attempting to read all available o2 data for thessp585
scenario.This throws an error:
I think the problem is that one of the members (
r1i1p1f1
) has a longer runtime than the others.If you only load that member it works:
But you can also see that it has 3000+ timesteps (running till 2300).
For comparison, other members only have ~1000 timesteps (standard time until 2100):
It seems that intake-esm struggles with merging those datasets. In the cloud storage the extended member was removed, so the initial call works (cc @naomi-henderson). But I would be really interested in using these extended data.
Is there a way to make this work? I thought it might be possible to alter the
attrs
of the data sets with preprocessing, but had no luck in getting this to work. I think in principle these are mislabelled and should be a separate experimentssp585-extended
or similar.Any suggestions for solving this?