Closed naomi-henderson closed 4 years ago
Hi @naomi-henderson,
Thank you for reporting these annoyances/bugs and proposing straightforward fixes! I will work on it sometime today, and will ping you once they are ready.
@andersy005 , looks like the version
issue was my own problem - so please leave the name version
, not version_id
thanks!
@naomi-henderson,
activity_id and institution_id were set before
With #65, it became clear that the assumption we made about the directory structure didn't work for everybody. So, we made a change in how to get some of the attributes including activity_id
, institution_id
, etc.. This change consists of specifying these attributes in the YAML file used to build the catalog:
name: GLADE-CMIP6
collection_type: cmip6
data_sources:
CESM2-WACCM-AerChemMIP:
locations:
- name: CESM2-WACCM-AerChemMIP-catalog
loc_type: posix
direct_access: True
urlpath: /glade/collections/cmip/CMIP6/AerChemMIP/NCAR/CESM2-WACCM
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
extra_attributes:
mip_era: CMIP6
activity_id: AerChemMIP
institution_id: NCAR
Here's a full version of YAML file with entries for CMIP6 data @ NCAR: https://github.com/NCAR/intake-esm-datastore/blob/master/collection-input/glade-cmip6-collection.yml
When you get time, can you post the YAML file you are using to build the catalogue?
When table_id is AERmonZ, it becomes AERmon and ODAY, CFday become day. When grid_label is gr1, gr1z, grz or gr2, it becomes gr and gnz becomes gn
I can confirm that this is happening on my end too:
In [1]: import intake
In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP6")
In [3]: col.df.head()
Out[3]:
resource resource_type direct_access activity_id experiment_id ... source_id table_id time_range variable_id version
355093 CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c... posix True PAMIP pdSST-futAntSIC ... CESM2 6hrPlev 200006010000-200106010000 pr v20190614
355092 CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c... posix True PAMIP pdSST-futAntSIC ... CESM2 6hrPlev 200006010000-200106010000 psl v20190614
355089 CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c... posix True PAMIP pdSST-futAntSIC ... CESM2 6hrPlev 200006010000-200106010000 sfcWind v20190614
355090 CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c... posix True PAMIP pdSST-futAntSIC ... CESM2 6hrPlev 200006010000-200106010000 tas v20190614
355091 CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c... posix True PAMIP pdSST-futAntSIC ... CESM2 6hrPlev 200006010000-200106010000 zg1000 v20190614
[5 rows x 17 columns]
In [4]: col.df.table_id.unique()
Out[4]:
array(['6hrPlev', 'AERday', 'AERmon', 'Amon', 'CFday', 'Eday', 'Emon',
'LImon', 'Lmon', 'SIday', 'day', 'fx', 'Omon', 'SImon', 'CFmon',
'ImonAnt', 'ImonGre', '3hr', '6hrLev', 'Efx', 'Eyr', 'IfxGre',
'Oday', 'Ofx', 'Oyr', 'E1hr', 'E3hr', 'CFsubhr'], dtype=object)
In [5]: col.df.source_id.unique()
Out[5]:
array(['CESM2', 'CanESM5', 'IPSL-CM6A-LR', 'MIROC6', 'MRI-ESM2-0',
'GISS-E2-1-G', 'historical', 'GISS-E2-1-H', 'CNRM-CM6-1',
'CESM2-WACCM', 'AWI-CM-1-1-MR', 'BCC-CSM2-MR', 'BCC-ESM1',
'FGOALS-f3-L', 'E3SM-1-0', 'EC-Earth3-LR', '1pctCO2',
'abrupt-4xCO2', 'amip', 'piControl', 'GFDL-AM4', 'GFDL-CM4',
'SAM0-UNICON', 'CNRM-ESM2-1', 'UKESM1-0-LL', 'EC-Earth3'],
dtype=object)
In [6]: col.df.grid_label.unique()
Out[6]: array(['gn', 'gr', 'gm'], dtype=object)
In [7]: col.df.activity_id.unique()
Out[7]:
array(['PAMIP', 'CMIP', 'ScenarioMIP', 'AerChemMIP', 'CFMIP', 'LS3MIP',
'LUMIP'], dtype=object)
In [8]: col.df.institution_id.unique()
Out[8]:
array(['NCAR', 'CCCma', 'IPSL', 'MIROC', 'MRI', 'NASA-GISS',
'CNRM-CERFACS', 'AWI', 'BCC', 'CAS', 'E3SM-Project',
'EC-Earth-Consortium', 'NOAA-GFDL', 'SNU', 'MOHC'], dtype=object)
and the issue is stemming from these lines:
as you pointed out.
We used to have these lines,
f_split = filepath.split('/') fileparts['activity_id'] = f_split[-10] fileparts['institution_id'] = f_split[-9] fileparts['table_id'] = f_split[-5] fileparts['grid_label'] = f_split[-3] fileparts['version_id'] = f_split[-2]
but as I pointed out, with #62, it's hard to guarantee that everybody's directory structure is the same. That is why we replaced them with the regular expression allowing us not to depend on the directory structure.
I am now looking into a fix for this issue, but it will likely keep the regular expressions in-place.
@andersy005 , hmmm, too bad - that means I have to regenerate the YAML file each time ESGF adds a new activity_id or institution_id to the database (which is still happening). I am automatically downloading new data to add to our local collection and then using intake-esm
to generate a local catalog, so I would need to check the YAML file for a proper section prior to making the local catalog. I am trying to get an end-to-end solution for downloading new data from ESGF, checking its integrity and then converting to zarr, so trying to keep as 'fool'-proof as possible (me being the 'fool' in question)
For now I think I will just generate the keys I need on the fly based on my directory structure and then re-write the local catalog each time. When I get a change I will probably write a piece of code which auto-generates a new YAML prior to using intake-esm.
thanks again for this terrific tool!
@andersy005 , one more thing while you are looking at the use of regular expressions. It turns out that not all versions are r'v\d{4}\d{2}\d{2}'
. The CMIP6.CMIP.CAMS.CAMS-CSM1-0
folks, for example, used v1
. Maybe they will fix it, but there are a disturbing number of idiosyncrasies in the CMIP6 data ... so it was more reliable to get the version from the path. Just my 2 cents
that means I have to regenerate the YAML file each time ESGF adds a new activity_id or institution_id to the database (which is still happening).
This applies to CMIP6 data @ NCAR. The data keeps appearing on the filesystem. The YAML file I linked to was semi auto-generated a few weeks ago, but I am pretty certain we've had new datasets on disk since then. We are also interested in ways to automate this process of generating the YAML file since the datasets on disk are not static.
When I get a change I will probably write a piece of code which auto-generates a new YAML prior to using intake-esm.
If you are interested in how I automated the YAML file generation, I just uploaded the notebook with details: https://gist.github.com/andersy005/5cc53f9285ae6c0abb2ee573250b4ba9 Let me know if you find it useful, and we can work together on standardizing this functionality.
Maybe they will fix it, but there are a disturbing number of idiosyncrasies in the CMIP6 data
I concur. it's not always guaranteed that everyone is following the official data reference syntax (DRS).
so it was more reliable to get the version from the path. Just my 2 cents
I personally am not a big fan of regular expressions. I like their flexibility, but regular expression use comes at a cost too. I will revisit previous implementations and see if regular expressions are really worthwhile. If not, I will see if we can get everything to work without them.
@andersy005 , thanks for your very useful suggestions and your notebook for generating the YAML. Could you also post your Jinja2 template file?
nevermind, I found it, thanks!
@naomi-henderson,
I removed all regular expression matching in #113 except the version
one ( I updated it). So far, it seems to be working:
In [10]: col.df.grid_label.unique()
Out[10]:
array(['gn', 'gr', 'grz', 'gnz', 'gra', 'grg', 'gr1', 'gr2', 'gr1z',
'gr2z', 'gnMVSyfC84507-000912', 'gm'], dtype=object)
In [11]: col.df.version.unique()
Out[11]:
array(['v20190614', 'v20190528', 'v20190430', 'v20190429', 'v20190306',
'v20190326', 'v20180914', 'v20181109', 'v20180803', 'v20190311',
'v20181214', 'v20181212', 'v20190308', 'v20190603', 'v20180830',
'v20181015', 'v20190403', 'v20190313', 'v20190522', 'v20190415',
'v20190119', 'v20190125', 'v20190514', 'v20190606', 'v20190531',
'v20190408', 'v20190419', 'v20190302', 'v20190304', 'v20190507',
'v20181218', 'v20181122', 'v20190226', 'v20181012', 'v20181016',
'v20190121', 'v20190315', 'v20190116', 'v20181126', 'v20181213',
'v20181114', 'v20181127', 'v20181009', 'v20190221', 'v20190613',
'v20190611', 'v20190530', 'v20190202', 'v20181217', 'v20181227',
'v20181129', 'v20181202', 'v20181211', 'v1', 'v20190422',
'v20190508', 'v20181108', 'v20190206', 'v20180608', 'v20190103',
'v20190605', 'v20180727', 'v20190305', 'v20190118', 'v20181005',
'v20180802', 'v20181123', 'v20181022', 'v20180808', 'v2',
'v20190222', 'v20180905', 'v20181017', 'v20180920', 'v20181002',
'v20180827', 'v20180824', 'v20190410', 'v20190425', 'v20190220',
'v20190401', 'v20190227', 'v20190320', 'v20190218', 'v20190319',
'v20190723', 'v20180807', 'v20180301', 'v20180701', 'v20180319',
'v20190201', 'v20190323', 'v20190314', 'v20180626', 'v20180705',
'v20181203', 'v20180917', 'v20180814', 'v20181018', 'v20181116',
'v20181026', 'v20181205', 'v20181206', 'v20181115', 'v20190406',
'v20190404', 'v20190623', 'v20190219', 'v20190328', 'v20190307',
'v20190503', 'v20190510', 'v20190620', 'v20190617', 'v20190411',
'v20181031', 'v20190405', 'v20181106', 'v20190502', 'v20181119',
'v20181102', 'v20180828', 'v20181107', 'v20190208', 'v20190604',
'v20190624', 'v20180829'], dtype=object)
In [12]: col.df.table_id.unique()
Out[12]:
array(['6hrPlev', 'AERday', 'AERmonZ', 'Amon', 'CFday', 'Eday', 'Emon',
'EmonZ', 'LImon', 'Lmon', 'SIday', 'day', 'fx', 'AERmon', 'Omon',
'SImon', 'CFmon', 'ImonAnt', 'ImonGre', '3hr', '6hrLev',
'6hrPlevPt', 'EdayZ', 'Efx', 'Eyr', 'IfxGre', 'Oday', 'Ofx', 'Oyr',
'E1hr', 'E3hr', 'CFsubhr'], dtype=object)
In [13]: col.df.institution_id.unique()
Out[13]:
array(['NCAR', 'CCCma', 'IPSL', 'MIROC', 'MRI', 'NASA-GISS',
'CNRM-CERFACS', 'AWI', 'BCC', 'CAMS', 'CAS', 'E3SM-Project',
'EC-Earth-Consortium', 'NOAA-GFDL', 'SNU', 'MOHC'], dtype=object)
It turns out that not all versions are r'v\d{4}\d{2}\d{2}' . The CMIP6.CMIP.CAMS.CAMS-CSM1-0 folks, for example, used v1
I updated the version regular expression to version_regex = r'v\d{4}\d{2}\d{2}|v\d{1}'
In [25]: col.search(version=['v1']).query_results[['institution_id', 'variable_id', 'grid_label', 'version']].head(10)
Out[25]:
institution_id variable_id grid_label version
29182 CAMS ps gn v1
29183 CAMS ts gn v1
@naomi-henderson,
Hopefully, #113 fixes all the issues you pointed out. Thank you for the bug report. It's incredibly useful to receive bug reports. Also, it's great to know that intake-esm is useful beyond NCAR!
Fantastic, @andersy005 ! You even got the 'v?' versions working.
I tried to include institution_id
and activity_id
into my template, but it is complicated by the fact that, on our local machine, we have stored the CMIP6 data with the same directory structure on many drives. So I was using multiple location
sections (one for each 8TB drive) in the yaml file. Each drive has whatever activity_ids and institution_ids that happen to be stored on that drive. I use the intake-esm collection to then generate a single master directory with links to all of the files for our data server.
You can see that making a separate entry for each combination of [institution_id, activity_id, location] creates a large number of entries (and produces that manytqdm
progress bars!) and generally seems to complicate the very simple task of determining the institution_id and activity_id. So I am still using the multiple locations, resetting those 2 keys after generating the collection and then overwriting the ~/.intake_esm/collection/cmip6/AR6_PANGEO.cmip6.csv
file.
Fortunately, when we upload to the cloud I can avoid this issue by storing the zarr files in a common directory (as you do on the glade system at NCAR), will be able to have a single location section and will not have to do the reset/overwrite step.
Any suggestions? Here is my yaml file:
name: AR6_PANGEO
collection_type: cmip6
data_sources:
fletcher.ldeo.columbia.edu:
locations:
- name: dm10_AR6-Omon5
loc_type: posix
direct_access: True
urlpath: /dm10/naomi/AR6-Omon5
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm11_AR6-Omon6
loc_type: posix
direct_access: True
urlpath: /dm11/naomi/AR6-Omon6
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm12_AR6-Omon7
loc_type: posix
direct_access: True
urlpath: /dm12/naomi/AR6-Omon7
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm13_AR6-Amon2
loc_type: posix
direct_access: True
urlpath: /dm13/naomi/AR6-Amon2
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm13_AR6-Omon7-2
loc_type: posix
direct_access: True
urlpath: /dm13/naomi/AR6-Omon7-2
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm14_AR6-AERmon2
loc_type: posix
direct_access: True
urlpath: /dm14/naomi/AR6-AERmon2
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm15_AR6-day2
loc_type: posix
direct_access: True
urlpath: /dm15/naomi/AR6-day2
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm16_AR6-Omon7-3
loc_type: posix
direct_access: True
urlpath: /dm16/naomi/AR6-Omon7-3
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm1_AR6-mon-other
loc_type: posix
direct_access: True
urlpath: /dm1/naomi/AR6-mon-other
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm2_AR6-AERmon
loc_type: posix
direct_access: True
urlpath: /dm2/naomi/AR6-AERmon
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm3_AR6-Amon
loc_type: posix
direct_access: True
urlpath: /dm3/naomi/AR6-Amon
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm4_AR6-day
loc_type: posix
direct_access: True
urlpath: /dm4/naomi/AR6-day
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm5_AR6-other
loc_type: posix
direct_access: True
urlpath: /dm5/naomi/AR6-other
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm6_AR6-Omon1
loc_type: posix
direct_access: True
urlpath: /dm6/naomi/AR6-Omon1
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm7_AR6-Omon2
loc_type: posix
direct_access: True
urlpath: /dm7/naomi/AR6-Omon2
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm8_AR6-Omon3
loc_type: posix
direct_access: True
urlpath: /dm8/naomi/AR6-Omon3
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
- name: dm9_AR6-Omon4
loc_type: posix
direct_access: True
urlpath: /dm9/naomi/AR6-Omon4
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
extra_attributes:
mip_era: CMIP6
@naomi-henderson,
creates a large number of entries (and produces that manytqdm progress bars!
If we make the tqdm
progress bar optional (basically allow users to opt-in or opt-out in case they have a massive YAML file), would making a separate entry for each combination of [institution_id, activity_id, location] still be a problem?
@andersy005 , but this is 20*18*(number of locations) of entries, which in my case is 5,766. I haven't benchmarked it (perhaps I should) - but do the many separate data searches have so little overhead that this is a feasible option?
Would it be possible instead to set institution_id and activity_id in cmip.py
as before (assuming the ESGF directory structure), but let these keys be ignored by those who prefer to use a yaml file to reset them?
Or ... just thinking outload ... dictionaries would allow us to get institution_id from source_id and get activity_id from experiment_id in most cases. There is one experiment_id contained in two activity_ids ( piClim-aer
is in both AerChemMIP
and RFMIP
), but this is due to a mistake in the activity_drs
vs activity_id
keys and only RFMIP
is really the correct activity_id for piClim-aer
.
Would it be possible instead to set institution_id and activity_id in cmip.py as before (assuming the ESGF directory structure), but let these keys be ignored by those who prefer to use a yaml file to reset them?
After spending time fixing the regular expressions issue, I am in favor of this option. I will merge #113, and will revert back to ESGF directory structure as the default option and let users with a different directory structure override them via YAML.
@naomi-henderson
Or ... just thinking outload ... dictionaries would allow us to get institution_id from source_id and get activity_id from experiment_id in most cases.
Since the majority of this information (source_id, institution_id, etc..) if not all of it can be retrieved from https://github.com/WCRP-CMIP/CMIP6_CVs, the dictionary approach is also another doable option..
Now, I am in a dilemma over which approach to choose between the two that you proposed :)
@naomi-henderson,
When you get time, can you post a snippet of what the new YAML file content would look like for you if we revert back to depending on ESGF directory structure solely? I would like to see commonalities between your YAML file and my version in order to determine what changes would need to be made to the existing codebase to support both use cases.
I would just use the yaml posted in https://github.com/NCAR/intake-esm/issues/111#issuecomment-515724876 - one section for each drive
@naomi-henderson, @aaronspring
name: GLADE-CMIP6
collection_type: cmip6
data_sources:
GLADE-DATA:
locations:
- name: CMIP-AP
loc_type: posix
direct_access: True
urlpath: /glade/collections/cmip/CMIP6
exclude_dirs: ['*/files/*', '*/latest/*']
file_extension: .nc
I am going to merge it soon. Please give it a try and let me know whether it works for you.
@andersy005 , I have just updated after your latest commit. I am having a few issues, including a KeyError: 'progress-bar'
until I added a config.set({'progress-bar': False})
and a KeyError: 'direct-access'
was preventing the netcdf collection to be saved until I commented out a line in collection.py/_persist_db_file/self._ds.to_netcdf which sets the encoding for the boolean direct_access
. But these are minor problems, easily fixed.
What is the advantage of the new netcdf db file over the old csv? And why, in particular, the switch from dataframes to datasets? All of my code uses dataframe
methods, not dataset
methods. For example, I get all of the possible values of activity_id
, by using collection.df.activity_id.unique()
. Of course I can use to_dataframe()
to convert, but just wondered what motivated this? The csv/dataframe works better with mixed datatypes than netcdf/datasets, no?
Anyway, I will continue to work through my codes to get them to work again. I see that the version
parsing is working well, but we now get a grid_label = 'gn3RaXbM42915'
from
.../CMIP/SNU/SAM0-UNICON/historical/r1i1p1f1/day/tas/gn/v20190323/tas_day_SAM0-UNICON_historical_r1i1p1f1_gn3RaXbM42915.nc
because the directory parsing to get grid_label has now been changed to a file name parsing - giving an incorrect value.
Thanks a million for all of the hard work, I really am trying to keep up but am distracted by actually using all of the new methods!
@andersy005 : quick question (perhaps @aaronspring has figured this out already?)
When I do a search on the new type of collection, I would like to use dataframe methods such as drop_duplicates. With old csv/dataframe , I used to use query_results
following a search:
#OLD VERSION:
col.search(variable_id=['hfls'], table_id='Amon').query_results.drop_duplicates(subset=["file_basename","version"],keep='first')
In the new netcdf/dataset version, query_results is not an option, so I am using the chained .get_results().to_dataframe()
instead, which seems pretty convoluted:
#NEW VERSION
col.search(variable_id=['hfls'], table_id='Amon').get_results().to_dataframe().drop_duplicates(subset=["file_basename","version"],keep='first')
How are we really meant to be doing this in the new netcdf/dataset version?
As you can tell, I do not just use intake_esm
to generate a catalog for intake
in order to get the datasets. I am needing to do all kinds of checks on the CMIP6 netcdf files in order to clean up our local collection - and am heavily using the dataframes methods to accomplish this. If anyone is interested, I am also developing a long list of exceptions/problems with the netcdf files and how to fix or when to exclude.
What is the advantage of the new netcdf db file over the old csv? And why, in particular, the switch from dataframes to datasets?
@naomi-henderson, the motivation for switching from (csv) dataframe to ( netcdf) can be summarized as follows:
1) When persisting the dataframe as .csv
, you lose all the information about the data types of columns. As a result, when the dataframe was loaded at another, pandas had to do dtype inference which sometimes wasn't consistent. For instance, a boolean column would be loaded as float
.
2) There were some useful information that intake-esm needed to know about the collection at runtime. For instance, the collection type, etc... With a dataframe, we could save this information as part of the csv. The workaround was to encoded some of this information as part of the csv filename. With a netCDF file, we can attach all kinds of attributes to the dataset:
In [4]: col.ds
Out[4]:
<xarray.Dataset>
Dimensions: (index: 615296)
Coordinates:
* index (index) int64 0 1 2 3 4 ... 615292 615293 615294 615295
Data variables:
resource (index) object ...
resource_type (index) object ...
direct_access (index) bool True True True True ... True True True True
activity (index) object ...
ensemble_member (index) object ...
experiment (index) object ...
file_basename (index) object ...
file_fullpath (index) object ...
frequency (index) object ...
institute (index) object ...
mip_table (index) object ...
model (index) object ...
modeling_realm (index) object ...
product (index) object ...
temporal_subset (index) object ...
variable (index) object ...
version (index) object ...
Attributes:
created_at: 2019-08-07T18:05:09.371259
intake_esm_version: 2019.5.11.post153
intake_version: 0.5.2
intake_xarray_version: 0.3.1
collection_spec: {"name": "GLADE-CMIP5", "collection_type": "cmip5...
name: GLADE-CMIP5
collection_type: cmip5
When we open this netCDF, it's a matter of looking into the global attributes section of the dataset to find all sorts of information. Some of this info such as collection_type
are used internally by intake-esm. The rest of the global attributes are useful for debugging, provenance purposes.
I am having a few issues, including a KeyError: 'progress-bar' until I added a config.set({'progress-bar': False}) and a KeyError: 'direct-access' was preventing the netcdf collection to be saved until I commented out a line in collection.py/_persist_db_file/self._ds.to_netcdf which sets the encoding for the boolean direct_access. But these are minor problems, easily fixed.
I recommend deleting the old YAML config files residing in ~/.intake_esm/
for new changes to take effect without conflicting with the old configurations.
In the new netcdf/dataset version, query_results is not an option, so I am using the chained .get_results().to_dataframe() instead, which seems pretty convoluted:
In the previous versions, we had two different ways of accessing the dataframe
col = intake.open_esm_metadatastore(.......)
col.df # The entire collection
# Search
cat = col.search(......)
cat.query_results # Dataframe containg search results
For consistency, the .query_results
attribute was replace with .ds
attribute
col = intake.open_esm_metadatastore(.......)
col.ds # The entire collection
# Search
cat = col.search(......)
cat.ds # dataset containg search results
Therefore, the following will work for you:
#NEW VERSION
col.search(variable_id=['hfls'], table_id='Amon').ds.to_dataframe().drop_duplicates(subset=["file_basename","version"],keep='first')
@andersy005 Thanks! That is what I needed to know - and deleting the old yaml files definitely helps
Few more comments:
file_dirname
key that we used to have in the CMIP6Collection - saves me re-generating from the other keys (I use it to name the zarr stores)As you can tell, I do not just use intake_esm to generate a catalog for intake in order to get the datasets. I am needing to do all kinds of checks on the CMIP6 netcdf files in order to clean up our local collection - and am heavily using the dataframes methods to accomplish this. If anyone is interested, I am also developing a long list of exceptions/problems with the netcdf files and how to fix or when to exclude.
I'm sorry for making intake-esm
a moving target in the last few weeks. I am hoping that things will stabilize soon.
One thing I can do to help is implementing a df
property that will basically allow you to use the previous .df
attribute. In the background, intake-esm
would still use the dataset internally, but as a user you can interface with intake-esm
via the .df
It would be nice to keep the old file_dirname key that we used to have in the CMIP6Collection - saves me re-generating from the other keys (I use it to name the zarr stores)
And while you are there, could we generate grid_label from the path? That is what is important, not the actual file name (see #111 (comment))
Definitely. I will open a new PR soon to address all these issues.
@naomi-henderson, Thank you for reporting all these issues. Feel free to ping me whenever I break something or you run into any other brick walls :) Your feedback is appreciated!
Thanks, @andersy005 , it would be convenient to use .df instead if .ds.to_dataframe(), but not if it will cause future confusion. I can keep converting to dataframe and when I write new code will try to use dataset methods.
Thanks for your patience and understanding while we try to keep up with the latest advances!
I just re-introduced the .df
:
In [1]: import intake
In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP6")
In [3]: col.df
Out[3]:
resource resource_type ... variable_id version
index ...
0 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... pr v20190614
1 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... psl v20190614
2 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... sfcWind v20190614
3 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... tas v20190614
4 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... zg1000 v20190614
... ... ... ... ... ...
418858 GLADE-DATA:CMIP:posix:/glade/collections/cmip/... posix ... sisnconc v20190429
418859 GLADE-DATA:CMIP:posix:/glade/collections/cmip/... posix ... sisnmass v20190429
418860 GLADE-DATA:CMIP:posix:/glade/collections/cmip/... posix ... sisnthick v20190429
418861 GLADE-DATA:CMIP:posix:/glade/collections/cmip/... posix ... sispeed v20190429
418862 GLADE-DATA:CMIP:posix:/glade/collections/cmip/... posix ... siv v20190429
[418863 rows x 16 columns]
In [5]: cat = col.search(variable_id='pr')
In [6]: cat.df
Out[6]:
resource resource_type ... variable_id version
index ...
0 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... pr v20190614
18 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... pr v20190614
81 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... pr v20190614
93 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... pr v20190528
111 GLADE-DATA:PAMIP:posix:/glade/collections/cmip... posix ... pr v20190528
... ... ... ... ... ...
417683 GLADE-DATA:ScenarioMIP:posix:/glade/collection... posix ... pr v20190119
417713 GLADE-DATA:CMIP:posix:/glade/collections/cmip/... posix ... pr v20190125
418191 GLADE-DATA:CMIP:posix:/glade/collections/cmip/... posix ... pr v20190603
418192 GLADE-DATA:CMIP:posix:/glade/collections/cmip/... posix ... pr v20190603
418496 GLADE-DATA:CMIP:posix:/glade/collections/cmip/... posix ... pr v20190429
[3871 rows x 16 columns]
You will notice that I didn't re-introduce the query_results
. For consistency, I just added .df
for query_results
:
In [5]: cat = col.search(variable_id='pr')
In [6]: cat.df
@naomi-henderson,
it would be convenient to use .df instead if .ds.to_dataframe(), but not if it will cause future confusion. I can keep converting to dataframe and when I write new code will try to use dataset methods.
with #127
col.search(variable_id=['hfls'], table_id='Amon')\
.df.drop_duplicates(subset=["file_basename","version"],keep='first')
should work. Let me know if it doesn't work as expected.
If you have a minute, can you take a look at #127 and let me know if there's anything missing? I'd like to merge it once you've given it a green light.
Thanks a million for all of the hard work, I really am trying to keep up but am distracted by actually using all of the new methods!
I will try my best to keep intake-esm
stable moving forward :), and will do a better job of documenting changes in the future. Thank you for your collaboration!
@andersy005 , the changes look good except for grid_label
:
.../CMIP/EC-Earth-Consortium/EC-Earth3/historical/r24i1p1f1/Omon/so/gn/v20190411/so_Omon_EC-Earth3_historical_r24i1p1f1_gn_185001-185012.nc
has variable_id = 'so', the grid label turns out to be 'rtium' (it split 'Consortium' at 'so')
Good catch. My approach isn't robust enough and I now expect it to fail for other cases as well. Instead of splitting at "so", I am going to update it to split at "/so/":
In [10]: a = ".../CMIP/EC-Earth-Consortium/EC-Earth3/historical/r24i1p1f1/Omon/so/gn/v20190411/so_Omon_EC-Earth3_historica
...: l_r24i1p1f1_gn_185001-185012.nc"
In [11]: variable = "so"
In [12]: a.split("/so/")
Out[12]:
['.../CMIP/EC-Earth-Consortium/EC-Earth3/historical/r24i1p1f1/Omon',
'gn/v20190411/so_Omon_EC-Earth3_historical_r24i1p1f1_gn_185001-185012.nc']
In [13]: a.split("so")
Out[13]:
['.../CMIP/EC-Earth-Con',
'rtium/EC-Earth3/historical/r24i1p1f1/Omon/',
'/gn/v20190411/',
'_Omon_EC-Earth3_historical_r24i1p1f1_gn_185001-185012.nc']
In [24]: fileparts['source_id'] = source_id
In [25]: fileparts['variable_id'] = variable_id
In [26]: fileparts
Out[26]: {'source_id': 'EC-Earth3', 'variable_id': 'so'}
In [27]: parent.split(f"/{fileparts['source_id']}/")
Out[27]: ['.../CMIP/EC-Earth-Consortium', 'historical/r24i1p1f1/Omon/so/gn/v20190411']
In [28]: parent.split(f"/{fileparts['variable_id']}/")[1].strip('/').split('/')[0]
Out[28]: 'gn'
yes, that works!
@naomi-henderson,
In one of the comments you pointed out that you use the unique()
method from pandas
All of my code uses dataframe methods, not dataset methods. For example, I get all of the possible values of activity_id, by using
collection.df.activity_id.unique()
.
I just implemented two new methods (nunique()
and unique()
) that try to mimic pandas' methods in #128:
In [1]: import intake
In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP5")
In [3]: col.nunique()
Out[3]:
resource 3
resource_type 1
direct_access 1
activity 1
ensemble_member 218
experiment 51
file_basename 312093
file_fullpath 615853
frequency 6
institute 25
mip_table 15
model 53
modeling_realm 7
product 3
temporal_subset 9121
variable 454
version 489
dtype: int64
In [4]: col.unique(columns=['frequency', 'modeling_realm'])
Out[4]:
{'frequency': {'count': 6, 'values': ['mon', 'day', '6hr', 'yr', '3hr', 'fx']},
'modeling_realm': {'count': 7,
'values': ['atmos',
'land',
'ocean',
'seaIce',
'ocnBgchem',
'landIce',
'aerosol']}}
collection.df.activity_id.unique()
can be replaced with collection.unique(columns='activity_id')
@andersy005 fantastic! That will be very convenient, thank you
You are welcome! If you have ideas for other useful utility functions/methods, let me know.
Hi all, especially @andersy005 . Sorry this took so long to report, but have been using the old intake-esm until recently and just noticed these small annoyances in
cmip.py
ISSUES: There are a few issues with the CMIP6Collection after the latest re-factor.
table_id
andgrid_label
version
key values are all disappearing for me when saving the collectionactivity_id
andinstitution_id
were set beforeVERSION:
DETAILS:
When
table_id
is AERmonZ, it becomes AERmon and ODAY, CFday become day. Whengrid_label
is gr1, gr1z, grz or gr2, it becomes gr and gnz becomes gnThe
version
key is set correctly in cmip.py:CMIP6Collection/_get_file_attrs
, but when saving to the intake collection csv file, theversion
key values are all removed (blank). The trouble seems to be a conflict with the nameversion
inintake
, not in theintake-esm
. Using another name, e.g.version_id
, fixes the problem.PROPOSED FIX: The following lines added to
cmip.py/CMIP6Collection/_get_file_attrs
:and the following deleted: