naomi-henderson commented 4 years ago

Hi all, especially @andersy005 . Sorry this took so long to report, but have been using the old intake-esm until recently and just noticed these small annoyances in cmip.py

ISSUES: There are a few issues with the CMIP6Collection after the latest re-factor.

regular expression matching not working for table_id and grid_label
the version key values are all disappearing for me when saving the collection
activity_id and institution_id were set before

VERSION:

# Name                    Version                   Build  Channel
intake-esm                2019.5.11.post112          pypi_0    pypi

DETAILS:

When table_id is AERmonZ, it becomes AERmon and ODAY, CFday become day. When grid_label is gr1, gr1z, grz or gr2, it becomes gr and gnz becomes gn
The version key is set correctly in cmip.py: CMIP6Collection/_get_file_attrs, but when saving to the intake collection csv file, the version key values are all removed (blank). The trouble seems to be a conflict with the name version in intake, not in the intake-esm. Using another name, e.g. version_id, fixes the problem.

PROPOSED FIX: The following lines added to cmip.py/CMIP6Collection/_get_file_attrs:

        f_split = filepath.split('/')
        fileparts['activity_id'] = f_split[-10]
        fileparts['institution_id'] = f_split[-9]
        fileparts['table_id'] = f_split[-5]
        fileparts['grid_label'] = f_split[-3]
        fileparts['version_id'] = f_split[-2]

and the following deleted:

        table_id = CMIP6Collection._extract_attr_with_regex(filepath, regex=table_id_regex)
        grid_label = CMIP6Collection._extract_attr_with_regex(file_basename, regex=grid_label_regex)
        version = CMIP6Collection._extract_attr_with_regex(filepath, regex=version_regex) or 'v0'

        fileparts['table_id'] = table_id
        fileparts['grid_label'] = grid_label
        fileparts['version'] = version

andersy005 commented 4 years ago

Hi @naomi-henderson,

Thank you for reporting these annoyances/bugs and proposing straightforward fixes! I will work on it sometime today, and will ping you once they are ready.

naomi-henderson commented 4 years ago

@andersy005 , looks like the version issue was my own problem - so please leave the name version, not version_id thanks!

andersy005 commented 4 years ago

@naomi-henderson,

activity_id and institution_id were set before

With #65, it became clear that the assumption we made about the directory structure didn't work for everybody. So, we made a change in how to get some of the attributes including activity_id, institution_id, etc.. This change consists of specifying these attributes in the YAML file used to build the catalog:

name: GLADE-CMIP6
collection_type: cmip6
data_sources:
  CESM2-WACCM-AerChemMIP:
    locations:
      - name: CESM2-WACCM-AerChemMIP-catalog
        loc_type: posix
        direct_access: True 
        urlpath: /glade/collections/cmip/CMIP6/AerChemMIP/NCAR/CESM2-WACCM
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

    extra_attributes:
      mip_era: CMIP6
      activity_id: AerChemMIP
      institution_id: NCAR

Here's a full version of YAML file with entries for CMIP6 data @ NCAR: https://github.com/NCAR/intake-esm-datastore/blob/master/collection-input/glade-cmip6-collection.yml

When you get time, can you post the YAML file you are using to build the catalogue?

andersy005 commented 4 years ago

When table_id is AERmonZ, it becomes AERmon and ODAY, CFday become day. When grid_label is gr1, gr1z, grz or gr2, it becomes gr and gnz becomes gn

I can confirm that this is happening on my end too:

In [1]: import intake
In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP6")

In [3]: col.df.head()
Out[3]:
                                                 resource resource_type  direct_access activity_id    experiment_id  ... source_id table_id                 time_range variable_id    version
355093  CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c...         posix           True       PAMIP  pdSST-futAntSIC  ...     CESM2  6hrPlev  200006010000-200106010000          pr  v20190614
355092  CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c...         posix           True       PAMIP  pdSST-futAntSIC  ...     CESM2  6hrPlev  200006010000-200106010000         psl  v20190614
355089  CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c...         posix           True       PAMIP  pdSST-futAntSIC  ...     CESM2  6hrPlev  200006010000-200106010000     sfcWind  v20190614
355090  CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c...         posix           True       PAMIP  pdSST-futAntSIC  ...     CESM2  6hrPlev  200006010000-200106010000         tas  v20190614
355091  CESM2-PAMIP:CESM2-PAMIP-catalog:posix:/glade/c...         posix           True       PAMIP  pdSST-futAntSIC  ...     CESM2  6hrPlev  200006010000-200106010000      zg1000  v20190614

[5 rows x 17 columns]

In [4]: col.df.table_id.unique()
Out[4]:
array(['6hrPlev', 'AERday', 'AERmon', 'Amon', 'CFday', 'Eday', 'Emon',
       'LImon', 'Lmon', 'SIday', 'day', 'fx', 'Omon', 'SImon', 'CFmon',
       'ImonAnt', 'ImonGre', '3hr', '6hrLev', 'Efx', 'Eyr', 'IfxGre',
       'Oday', 'Ofx', 'Oyr', 'E1hr', 'E3hr', 'CFsubhr'], dtype=object)

In [5]: col.df.source_id.unique()
Out[5]:
array(['CESM2', 'CanESM5', 'IPSL-CM6A-LR', 'MIROC6', 'MRI-ESM2-0',
       'GISS-E2-1-G', 'historical', 'GISS-E2-1-H', 'CNRM-CM6-1',
       'CESM2-WACCM', 'AWI-CM-1-1-MR', 'BCC-CSM2-MR', 'BCC-ESM1',
       'FGOALS-f3-L', 'E3SM-1-0', 'EC-Earth3-LR', '1pctCO2',
       'abrupt-4xCO2', 'amip', 'piControl', 'GFDL-AM4', 'GFDL-CM4',
       'SAM0-UNICON', 'CNRM-ESM2-1', 'UKESM1-0-LL', 'EC-Earth3'],
      dtype=object)

In [6]: col.df.grid_label.unique()
Out[6]: array(['gn', 'gr', 'gm'], dtype=object)

In [7]: col.df.activity_id.unique()
Out[7]:
array(['PAMIP', 'CMIP', 'ScenarioMIP', 'AerChemMIP', 'CFMIP', 'LS3MIP',
       'LUMIP'], dtype=object)

In [8]: col.df.institution_id.unique()
Out[8]:
array(['NCAR', 'CCCma', 'IPSL', 'MIROC', 'MRI', 'NASA-GISS',
       'CNRM-CERFACS', 'AWI', 'BCC', 'CAS', 'E3SM-Project',
       'EC-Earth-Consortium', 'NOAA-GFDL', 'SNU', 'MOHC'], dtype=object)

and the issue is stemming from these lines:

https://github.com/NCAR/intake-esm/blob/70d90b724ca51dbb5852a7b570c0aaa4f3684fbd/intake_esm/cmip.py#L101-L103

as you pointed out.

We used to have these lines,

f_split = filepath.split('/') fileparts['activity_id'] = f_split[-10] fileparts['institution_id'] = f_split[-9] fileparts['table_id'] = f_split[-5] fileparts['grid_label'] = f_split[-3] fileparts['version_id'] = f_split[-2]

but as I pointed out, with #62, it's hard to guarantee that everybody's directory structure is the same. That is why we replaced them with the regular expression allowing us not to depend on the directory structure.

I am now looking into a fix for this issue, but it will likely keep the regular expressions in-place.

naomi-henderson commented 4 years ago

@andersy005 , hmmm, too bad - that means I have to regenerate the YAML file each time ESGF adds a new activity_id or institution_id to the database (which is still happening). I am automatically downloading new data to add to our local collection and then using intake-esm to generate a local catalog, so I would need to check the YAML file for a proper section prior to making the local catalog. I am trying to get an end-to-end solution for downloading new data from ESGF, checking its integrity and then converting to zarr, so trying to keep as 'fool'-proof as possible (me being the 'fool' in question)

For now I think I will just generate the keys I need on the fly based on my directory structure and then re-write the local catalog each time. When I get a change I will probably write a piece of code which auto-generates a new YAML prior to using intake-esm.

thanks again for this terrific tool!

naomi-henderson commented 4 years ago

@andersy005 , one more thing while you are looking at the use of regular expressions. It turns out that not all versions are r'v\d{4}\d{2}\d{2}' . The CMIP6.CMIP.CAMS.CAMS-CSM1-0 folks, for example, used v1 . Maybe they will fix it, but there are a disturbing number of idiosyncrasies in the CMIP6 data ... so it was more reliable to get the version from the path. Just my 2 cents

andersy005 commented 4 years ago

that means I have to regenerate the YAML file each time ESGF adds a new activity_id or institution_id to the database (which is still happening).

This applies to CMIP6 data @ NCAR. The data keeps appearing on the filesystem. The YAML file I linked to was semi auto-generated a few weeks ago, but I am pretty certain we've had new datasets on disk since then. We are also interested in ways to automate this process of generating the YAML file since the datasets on disk are not static.

When I get a change I will probably write a piece of code which auto-generates a new YAML prior to using intake-esm.

If you are interested in how I automated the YAML file generation, I just uploaded the notebook with details: https://gist.github.com/andersy005/5cc53f9285ae6c0abb2ee573250b4ba9 Let me know if you find it useful, and we can work together on standardizing this functionality.

andersy005 commented 4 years ago

Maybe they will fix it, but there are a disturbing number of idiosyncrasies in the CMIP6 data

I concur. it's not always guaranteed that everyone is following the official data reference syntax (DRS).

so it was more reliable to get the version from the path. Just my 2 cents

I personally am not a big fan of regular expressions. I like their flexibility, but regular expression use comes at a cost too. I will revisit previous implementations and see if regular expressions are really worthwhile. If not, I will see if we can get everything to work without them.

naomi-henderson commented 4 years ago

@andersy005 , thanks for your very useful suggestions and your notebook for generating the YAML. Could you also post your Jinja2 template file?

naomi-henderson commented 4 years ago

nevermind, I found it, thanks!

andersy005 commented 4 years ago

@naomi-henderson,

I removed all regular expression matching in #113 except the version one ( I updated it). So far, it seems to be working:

In [10]: col.df.grid_label.unique()                                                                                           
Out[10]: 
array(['gn', 'gr', 'grz', 'gnz', 'gra', 'grg', 'gr1', 'gr2', 'gr1z',
       'gr2z', 'gnMVSyfC84507-000912', 'gm'], dtype=object)

In [11]: col.df.version.unique()                                                                                              
Out[11]: 
array(['v20190614', 'v20190528', 'v20190430', 'v20190429', 'v20190306',
       'v20190326', 'v20180914', 'v20181109', 'v20180803', 'v20190311',
       'v20181214', 'v20181212', 'v20190308', 'v20190603', 'v20180830',
       'v20181015', 'v20190403', 'v20190313', 'v20190522', 'v20190415',
       'v20190119', 'v20190125', 'v20190514', 'v20190606', 'v20190531',
       'v20190408', 'v20190419', 'v20190302', 'v20190304', 'v20190507',
       'v20181218', 'v20181122', 'v20190226', 'v20181012', 'v20181016',
       'v20190121', 'v20190315', 'v20190116', 'v20181126', 'v20181213',
       'v20181114', 'v20181127', 'v20181009', 'v20190221', 'v20190613',
       'v20190611', 'v20190530', 'v20190202', 'v20181217', 'v20181227',
       'v20181129', 'v20181202', 'v20181211', 'v1', 'v20190422',
       'v20190508', 'v20181108', 'v20190206', 'v20180608', 'v20190103',
       'v20190605', 'v20180727', 'v20190305', 'v20190118', 'v20181005',
       'v20180802', 'v20181123', 'v20181022', 'v20180808', 'v2',
       'v20190222', 'v20180905', 'v20181017', 'v20180920', 'v20181002',
       'v20180827', 'v20180824', 'v20190410', 'v20190425', 'v20190220',
       'v20190401', 'v20190227', 'v20190320', 'v20190218', 'v20190319',
       'v20190723', 'v20180807', 'v20180301', 'v20180701', 'v20180319',
       'v20190201', 'v20190323', 'v20190314', 'v20180626', 'v20180705',
       'v20181203', 'v20180917', 'v20180814', 'v20181018', 'v20181116',
       'v20181026', 'v20181205', 'v20181206', 'v20181115', 'v20190406',
       'v20190404', 'v20190623', 'v20190219', 'v20190328', 'v20190307',
       'v20190503', 'v20190510', 'v20190620', 'v20190617', 'v20190411',
       'v20181031', 'v20190405', 'v20181106', 'v20190502', 'v20181119',
       'v20181102', 'v20180828', 'v20181107', 'v20190208', 'v20190604',
       'v20190624', 'v20180829'], dtype=object)

In [12]: col.df.table_id.unique()                                                                                             
Out[12]: 
array(['6hrPlev', 'AERday', 'AERmonZ', 'Amon', 'CFday', 'Eday', 'Emon',
       'EmonZ', 'LImon', 'Lmon', 'SIday', 'day', 'fx', 'AERmon', 'Omon',
       'SImon', 'CFmon', 'ImonAnt', 'ImonGre', '3hr', '6hrLev',
       '6hrPlevPt', 'EdayZ', 'Efx', 'Eyr', 'IfxGre', 'Oday', 'Ofx', 'Oyr',
       'E1hr', 'E3hr', 'CFsubhr'], dtype=object)

In [13]: col.df.institution_id.unique()                                                                                       
Out[13]: 
array(['NCAR', 'CCCma', 'IPSL', 'MIROC', 'MRI', 'NASA-GISS',
       'CNRM-CERFACS', 'AWI', 'BCC', 'CAMS', 'CAS', 'E3SM-Project',
       'EC-Earth-Consortium', 'NOAA-GFDL', 'SNU', 'MOHC'], dtype=object)

It turns out that not all versions are r'v\d{4}\d{2}\d{2}' . The CMIP6.CMIP.CAMS.CAMS-CSM1-0 folks, for example, used v1

I updated the version regular expression to version_regex = r'v\d{4}\d{2}\d{2}|v\d{1}'

In [25]: col.search(version=['v1']).query_results[['institution_id', 'variable_id', 'grid_label', 'version']].head(10)         
Out[25]: 
      institution_id variable_id grid_label version
29182           CAMS          ps         gn      v1
29183           CAMS          ts         gn      v1

andersy005 commented 4 years ago

@naomi-henderson,

Hopefully, #113 fixes all the issues you pointed out. Thank you for the bug report. It's incredibly useful to receive bug reports. Also, it's great to know that intake-esm is useful beyond NCAR!

n-henderson commented 4 years ago

Fantastic, @andersy005 ! You even got the 'v?' versions working.

I tried to include institution_id and activity_id into my template, but it is complicated by the fact that, on our local machine, we have stored the CMIP6 data with the same directory structure on many drives. So I was using multiple location sections (one for each 8TB drive) in the yaml file. Each drive has whatever activity_ids and institution_ids that happen to be stored on that drive. I use the intake-esm collection to then generate a single master directory with links to all of the files for our data server.

You can see that making a separate entry for each combination of [institution_id, activity_id, location] creates a large number of entries (and produces that manytqdm progress bars!) and generally seems to complicate the very simple task of determining the institution_id and activity_id. So I am still using the multiple locations, resetting those 2 keys after generating the collection and then overwriting the ~/.intake_esm/collection/cmip6/AR6_PANGEO.cmip6.csv file.

Fortunately, when we upload to the cloud I can avoid this issue by storing the zarr files in a common directory (as you do on the glade system at NCAR), will be able to have a single location section and will not have to do the reset/overwrite step.

Any suggestions? Here is my yaml file:

name: AR6_PANGEO
collection_type: cmip6
data_sources:
  fletcher.ldeo.columbia.edu:
    locations:
      - name: dm10_AR6-Omon5
        loc_type: posix
        direct_access: True 
        urlpath: /dm10/naomi/AR6-Omon5
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm11_AR6-Omon6
        loc_type: posix
        direct_access: True 
        urlpath: /dm11/naomi/AR6-Omon6
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm12_AR6-Omon7
        loc_type: posix
        direct_access: True 
        urlpath: /dm12/naomi/AR6-Omon7
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm13_AR6-Amon2
        loc_type: posix
        direct_access: True 
        urlpath: /dm13/naomi/AR6-Amon2
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm13_AR6-Omon7-2
        loc_type: posix
        direct_access: True 
        urlpath: /dm13/naomi/AR6-Omon7-2
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm14_AR6-AERmon2
        loc_type: posix
        direct_access: True 
        urlpath: /dm14/naomi/AR6-AERmon2
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm15_AR6-day2
        loc_type: posix
        direct_access: True 
        urlpath: /dm15/naomi/AR6-day2
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm16_AR6-Omon7-3
        loc_type: posix
        direct_access: True 
        urlpath: /dm16/naomi/AR6-Omon7-3
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm1_AR6-mon-other
        loc_type: posix
        direct_access: True 
        urlpath: /dm1/naomi/AR6-mon-other
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm2_AR6-AERmon
        loc_type: posix
        direct_access: True 
        urlpath: /dm2/naomi/AR6-AERmon
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm3_AR6-Amon
        loc_type: posix
        direct_access: True 
        urlpath: /dm3/naomi/AR6-Amon
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm4_AR6-day
        loc_type: posix
        direct_access: True 
        urlpath: /dm4/naomi/AR6-day
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm5_AR6-other
        loc_type: posix
        direct_access: True 
        urlpath: /dm5/naomi/AR6-other
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm6_AR6-Omon1
        loc_type: posix
        direct_access: True 
        urlpath: /dm6/naomi/AR6-Omon1
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm7_AR6-Omon2
        loc_type: posix
        direct_access: True 
        urlpath: /dm7/naomi/AR6-Omon2
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm8_AR6-Omon3
        loc_type: posix
        direct_access: True 
        urlpath: /dm8/naomi/AR6-Omon3
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

      - name: dm9_AR6-Omon4
        loc_type: posix
        direct_access: True 
        urlpath: /dm9/naomi/AR6-Omon4
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

    extra_attributes:
      mip_era: CMIP6

andersy005 commented 4 years ago

@naomi-henderson,

creates a large number of entries (and produces that manytqdm progress bars!

If we make the tqdm progress bar optional (basically allow users to opt-in or opt-out in case they have a massive YAML file), would making a separate entry for each combination of [institution_id, activity_id, location] still be a problem?

naomi-henderson commented 4 years ago

@andersy005 , but this is 20*18*(number of locations) of entries, which in my case is 5,766. I haven't benchmarked it (perhaps I should) - but do the many separate data searches have so little overhead that this is a feasible option?

Would it be possible instead to set institution_id and activity_id in cmip.py as before (assuming the ESGF directory structure), but let these keys be ignored by those who prefer to use a yaml file to reset them?

Or ... just thinking outload ... dictionaries would allow us to get institution_id from source_id and get activity_id from experiment_id in most cases. There is one experiment_id contained in two activity_ids ( piClim-aer is in both AerChemMIP and RFMIP), but this is due to a mistake in the activity_drs vs activity_id keys and only RFMIP is really the correct activity_id for piClim-aer.

andersy005 commented 4 years ago

Would it be possible instead to set institution_id and activity_id in cmip.py as before (assuming the ESGF directory structure), but let these keys be ignored by those who prefer to use a yaml file to reset them?

After spending time fixing the regular expressions issue, I am in favor of this option. I will merge #113, and will revert back to ESGF directory structure as the default option and let users with a different directory structure override them via YAML.

andersy005 commented 4 years ago

@naomi-henderson

Or ... just thinking outload ... dictionaries would allow us to get institution_id from source_id and get activity_id from experiment_id in most cases.

Since the majority of this information (source_id, institution_id, etc..) if not all of it can be retrieved from https://github.com/WCRP-CMIP/CMIP6_CVs, the dictionary approach is also another doable option..

Now, I am in a dilemma over which approach to choose between the two that you proposed :)

andersy005 commented 4 years ago

@naomi-henderson,

When you get time, can you post a snippet of what the new YAML file content would look like for you if we revert back to depending on ESGF directory structure solely? I would like to see commonalities between your YAML file and my version in order to determine what changes would need to be made to the existing codebase to support both use cases.

naomi-henderson commented 4 years ago

I would just use the yaml posted in https://github.com/NCAR/intake-esm/issues/111#issuecomment-515724876 - one section for each drive

andersy005 commented 4 years ago

@naomi-henderson, @aaronspring

126 reverts back to depending on ESGF directory structure when parsing CMIP5/CMIP6 attributes. With this change, specifying the root directory should suffice:

name: GLADE-CMIP6
collection_type: cmip6
data_sources:
  GLADE-DATA:
    locations:
      - name: CMIP-AP
        loc_type: posix
        direct_access: True 
        urlpath: /glade/collections/cmip/CMIP6
        exclude_dirs: ['*/files/*', '*/latest/*']
        file_extension: .nc

I am going to merge it soon. Please give it a try and let me know whether it works for you.

n-henderson commented 4 years ago

@andersy005 , I have just updated after your latest commit. I am having a few issues, including a KeyError: 'progress-bar' until I added a config.set({'progress-bar': False}) and a KeyError: 'direct-access' was preventing the netcdf collection to be saved until I commented out a line in collection.py/_persist_db_file/self._ds.to_netcdf which sets the encoding for the boolean direct_access. But these are minor problems, easily fixed.

What is the advantage of the new netcdf db file over the old csv? And why, in particular, the switch from dataframes to datasets? All of my code uses dataframe methods, not dataset methods. For example, I get all of the possible values of activity_id, by using collection.df.activity_id.unique(). Of course I can use to_dataframe() to convert, but just wondered what motivated this? The csv/dataframe works better with mixed datatypes than netcdf/datasets, no?

Anyway, I will continue to work through my codes to get them to work again. I see that the version parsing is working well, but we now get a grid_label = 'gn3RaXbM42915' from

.../CMIP/SNU/SAM0-UNICON/historical/r1i1p1f1/day/tas/gn/v20190323/tas_day_SAM0-UNICON_historical_r1i1p1f1_gn3RaXbM42915.nc

because the directory parsing to get grid_label has now been changed to a file name parsing - giving an incorrect value.

Thanks a million for all of the hard work, I really am trying to keep up but am distracted by actually using all of the new methods!

naomi-henderson commented 4 years ago

@andersy005 : quick question (perhaps @aaronspring has figured this out already?)

When I do a search on the new type of collection, I would like to use dataframe methods such as drop_duplicates. With old csv/dataframe , I used to use query_results following a search:

#OLD VERSION:
col.search(variable_id=['hfls'], table_id='Amon').query_results.drop_duplicates(subset=["file_basename","version"],keep='first')

In the new netcdf/dataset version, query_results is not an option, so I am using the chained .get_results().to_dataframe() instead, which seems pretty convoluted:

#NEW VERSION
col.search(variable_id=['hfls'], table_id='Amon').get_results().to_dataframe().drop_duplicates(subset=["file_basename","version"],keep='first')

How are we really meant to be doing this in the new netcdf/dataset version?

As you can tell, I do not just use intake_esm to generate a catalog for intake in order to get the datasets. I am needing to do all kinds of checks on the CMIP6 netcdf files in order to clean up our local collection - and am heavily using the dataframes methods to accomplish this. If anyone is interested, I am also developing a long list of exceptions/problems with the netcdf files and how to fix or when to exclude.

andersy005 commented 4 years ago

What is the advantage of the new netcdf db file over the old csv? And why, in particular, the switch from dataframes to datasets?

@naomi-henderson, the motivation for switching from (csv) dataframe to ( netcdf) can be summarized as follows:

1) When persisting the dataframe as .csv, you lose all the information about the data types of columns. As a result, when the dataframe was loaded at another, pandas had to do dtype inference which sometimes wasn't consistent. For instance, a boolean column would be loaded as float.

2) There were some useful information that intake-esm needed to know about the collection at runtime. For instance, the collection type, etc... With a dataframe, we could save this information as part of the csv. The workaround was to encoded some of this information as part of the csv filename. With a netCDF file, we can attach all kinds of attributes to the dataset:

In [4]: col.ds                                                                                                                                     
Out[4]: 
<xarray.Dataset>
Dimensions:          (index: 615296)
Coordinates:
  * index            (index) int64 0 1 2 3 4 ... 615292 615293 615294 615295
Data variables:
    resource         (index) object ...
    resource_type    (index) object ...
    direct_access    (index) bool True True True True ... True True True True
    activity         (index) object ...
    ensemble_member  (index) object ...
    experiment       (index) object ...
    file_basename    (index) object ...
    file_fullpath    (index) object ...
    frequency        (index) object ...
    institute        (index) object ...
    mip_table        (index) object ...
    model            (index) object ...
    modeling_realm   (index) object ...
    product          (index) object ...
    temporal_subset  (index) object ...
    variable         (index) object ...
    version          (index) object ...
Attributes:
    created_at:             2019-08-07T18:05:09.371259
    intake_esm_version:     2019.5.11.post153
    intake_version:         0.5.2
    intake_xarray_version:  0.3.1
    collection_spec:        {"name": "GLADE-CMIP5", "collection_type": "cmip5...
    name:                   GLADE-CMIP5
    collection_type:        cmip5

When we open this netCDF, it's a matter of looking into the global attributes section of the dataset to find all sorts of information. Some of this info such as collection_type are used internally by intake-esm. The rest of the global attributes are useful for debugging, provenance purposes.

I am having a few issues, including a KeyError: 'progress-bar' until I added a config.set({'progress-bar': False}) and a KeyError: 'direct-access' was preventing the netcdf collection to be saved until I commented out a line in collection.py/_persist_db_file/self._ds.to_netcdf which sets the encoding for the boolean direct_access. But these are minor problems, easily fixed.

I recommend deleting the old YAML config files residing in ~/.intake_esm/ for new changes to take effect without conflicting with the old configurations.

In the new netcdf/dataset version, query_results is not an option, so I am using the chained .get_results().to_dataframe() instead, which seems pretty convoluted:

In the previous versions, we had two different ways of accessing the dataframe

col = intake.open_esm_metadatastore(.......)
col.df # The entire collection

# Search
cat = col.search(......)
cat.query_results # Dataframe containg search results

For consistency, the .query_results attribute was replace with .ds attribute

col = intake.open_esm_metadatastore(.......)
col.ds # The entire collection

# Search
cat = col.search(......)
cat.ds # dataset containg search results

Therefore, the following will work for you:

#NEW VERSION
col.search(variable_id=['hfls'], table_id='Amon').ds.to_dataframe().drop_duplicates(subset=["file_basename","version"],keep='first')

naomi-henderson commented 4 years ago

@andersy005 Thanks! That is what I needed to know - and deleting the old yaml files definitely helps

Few more comments:

It would be nice to keep the old file_dirname key that we used to have in the CMIP6Collection - saves me re-generating from the other keys (I use it to name the zarr stores)
And while you are there, could we generate grid_label from the path? That is what is important, not the actual file name (see https://github.com/NCAR/intake-esm/issues/111#issuecomment-521484156)

andersy005 commented 4 years ago

As you can tell, I do not just use intake_esm to generate a catalog for intake in order to get the datasets. I am needing to do all kinds of checks on the CMIP6 netcdf files in order to clean up our local collection - and am heavily using the dataframes methods to accomplish this. If anyone is interested, I am also developing a long list of exceptions/problems with the netcdf files and how to fix or when to exclude.

I'm sorry for making intake-esm a moving target in the last few weeks. I am hoping that things will stabilize soon.

One thing I can do to help is implementing a df property that will basically allow you to use the previous .df attribute. In the background, intake-esm would still use the dataset internally, but as a user you can interface with intake-esm via the .df

It would be nice to keep the old file_dirname key that we used to have in the CMIP6Collection - saves me re-generating from the other keys (I use it to name the zarr stores)

And while you are there, could we generate grid_label from the path? That is what is important, not the actual file name (see #111 (comment))

Definitely. I will open a new PR soon to address all these issues.

@naomi-henderson, Thank you for reporting all these issues. Feel free to ping me whenever I break something or you run into any other brick walls :) Your feedback is appreciated!

naomi-henderson commented 4 years ago

Thanks, @andersy005 , it would be convenient to use .df instead if .ds.to_dataframe(), but not if it will cause future confusion. I can keep converting to dataframe and when I write new code will try to use dataset methods.

Thanks for your patience and understanding while we try to keep up with the latest advances!

andersy005 commented 4 years ago

I just re-introduced the .df:

In [1]: import intake                                                                                                     

In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP6")                                                

In [3]: col.df                                                                                                            
Out[3]: 
                                                 resource resource_type  ...  variable_id    version
index                                                                    ...                        
0       GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...           pr  v20190614
1       GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...          psl  v20190614
2       GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...      sfcWind  v20190614
3       GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...          tas  v20190614
4       GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...       zg1000  v20190614
...                                                   ...           ...  ...          ...        ...
418858  GLADE-DATA:CMIP:posix:/glade/collections/cmip/...         posix  ...     sisnconc  v20190429
418859  GLADE-DATA:CMIP:posix:/glade/collections/cmip/...         posix  ...     sisnmass  v20190429
418860  GLADE-DATA:CMIP:posix:/glade/collections/cmip/...         posix  ...    sisnthick  v20190429
418861  GLADE-DATA:CMIP:posix:/glade/collections/cmip/...         posix  ...      sispeed  v20190429
418862  GLADE-DATA:CMIP:posix:/glade/collections/cmip/...         posix  ...          siv  v20190429

[418863 rows x 16 columns]

In [5]: cat = col.search(variable_id='pr')                                                                                

In [6]: cat.df                                                                                                            
Out[6]: 
                                                 resource resource_type  ...  variable_id    version
index                                                                    ...                        
0       GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...           pr  v20190614
18      GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...           pr  v20190614
81      GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...           pr  v20190614
93      GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...           pr  v20190528
111     GLADE-DATA:PAMIP:posix:/glade/collections/cmip...         posix  ...           pr  v20190528
...                                                   ...           ...  ...          ...        ...
417683  GLADE-DATA:ScenarioMIP:posix:/glade/collection...         posix  ...           pr  v20190119
417713  GLADE-DATA:CMIP:posix:/glade/collections/cmip/...         posix  ...           pr  v20190125
418191  GLADE-DATA:CMIP:posix:/glade/collections/cmip/...         posix  ...           pr  v20190603
418192  GLADE-DATA:CMIP:posix:/glade/collections/cmip/...         posix  ...           pr  v20190603
418496  GLADE-DATA:CMIP:posix:/glade/collections/cmip/...         posix  ...           pr  v20190429

[3871 rows x 16 columns]

You will notice that I didn't re-introduce the query_results. For consistency, I just added .df for query_results:

In [5]: cat = col.search(variable_id='pr')                                                                                

In [6]: cat.df

andersy005 commented 4 years ago

@naomi-henderson,

it would be convenient to use .df instead if .ds.to_dataframe(), but not if it will cause future confusion. I can keep converting to dataframe and when I write new code will try to use dataset methods.

with #127

col.search(variable_id=['hfls'], table_id='Amon')\
     .df.drop_duplicates(subset=["file_basename","version"],keep='first')

should work. Let me know if it doesn't work as expected.

If you have a minute, can you take a look at #127 and let me know if there's anything missing? I'd like to merge it once you've given it a green light.

andersy005 commented 4 years ago

Thanks a million for all of the hard work, I really am trying to keep up but am distracted by actually using all of the new methods!

I will try my best to keep intake-esm stable moving forward :), and will do a better job of documenting changes in the future. Thank you for your collaboration!

naomi-henderson commented 4 years ago

@andersy005 , the changes look good except for grid_label:

.../CMIP/EC-Earth-Consortium/EC-Earth3/historical/r24i1p1f1/Omon/so/gn/v20190411/so_Omon_EC-Earth3_historical_r24i1p1f1_gn_185001-185012.nc

has variable_id = 'so', the grid label turns out to be 'rtium' (it split 'Consortium' at 'so')

andersy005 commented 4 years ago

Good catch. My approach isn't robust enough and I now expect it to fail for other cases as well. Instead of splitting at "so", I am going to update it to split at "/so/":


In [10]: a = ".../CMIP/EC-Earth-Consortium/EC-Earth3/historical/r24i1p1f1/Omon/so/gn/v20190411/so_Omon_EC-Earth3_historica
    ...: l_r24i1p1f1_gn_185001-185012.nc"                                                                                 

In [11]: variable = "so"                                                                                                  

In [12]: a.split("/so/")                                                                                                  
Out[12]: 
['.../CMIP/EC-Earth-Consortium/EC-Earth3/historical/r24i1p1f1/Omon',
 'gn/v20190411/so_Omon_EC-Earth3_historical_r24i1p1f1_gn_185001-185012.nc']

In [13]: a.split("so")                                                                                                    
Out[13]: 
['.../CMIP/EC-Earth-Con',
 'rtium/EC-Earth3/historical/r24i1p1f1/Omon/',
 '/gn/v20190411/',
 '_Omon_EC-Earth3_historical_r24i1p1f1_gn_185001-185012.nc']

andersy005 commented 4 years ago

In [24]: fileparts['source_id'] = source_id                                                                               

In [25]: fileparts['variable_id'] = variable_id                                                                           

In [26]: fileparts                                                                                                        
Out[26]: {'source_id': 'EC-Earth3', 'variable_id': 'so'}

In [27]: parent.split(f"/{fileparts['source_id']}/")                                                                      
Out[27]: ['.../CMIP/EC-Earth-Consortium', 'historical/r24i1p1f1/Omon/so/gn/v20190411']

In [28]: parent.split(f"/{fileparts['variable_id']}/")[1].strip('/').split('/')[0]                                        
Out[28]: 'gn'

naomi-henderson commented 4 years ago

yes, that works!

andersy005 commented 4 years ago

@naomi-henderson,

In one of the comments you pointed out that you use the unique() method from pandas

All of my code uses dataframe methods, not dataset methods. For example, I get all of the possible values of activity_id, by using collection.df.activity_id.unique().

I just implemented two new methods (nunique() and unique()) that try to mimic pandas' methods in #128:

In [1]: import intake                                                                                                                              

In [2]: col = intake.open_esm_metadatastore(collection_name="GLADE-CMIP5")                                                                         

In [3]: col.nunique()                                                                                                                              
Out[3]: 
resource                3
resource_type           1
direct_access           1
activity                1
ensemble_member       218
experiment             51
file_basename      312093
file_fullpath      615853
frequency               6
institute              25
mip_table              15
model                  53
modeling_realm          7
product                 3
temporal_subset      9121
variable              454
version               489
dtype: int64

In [4]: col.unique(columns=['frequency', 'modeling_realm'])                                                                                        
Out[4]: 
{'frequency': {'count': 6, 'values': ['mon', 'day', '6hr', 'yr', '3hr', 'fx']},
 'modeling_realm': {'count': 7,
  'values': ['atmos',
   'land',
   'ocean',
   'seaIce',
   'ocnBgchem',
   'landIce',
   'aerosol']}}

collection.df.activity_id.unique() can be replaced with collection.unique(columns='activity_id')

naomi-henderson commented 4 years ago

@andersy005 fantastic! That will be very convenient, thank you

andersy005 commented 4 years ago

You are welcome! If you have ideas for other useful utility functions/methods, let me know.

intake / intake-esm

CMIP6 cleanup needed after code refactoring #111

126 reverts back to depending on ESGF directory structure when parsing CMIP5/CMIP6 attributes. With this change, specifying the root directory should suffice: