pangeo-data / esgf2xarray

utilities for loading esgf archives as xarray datasets
Apache License 2.0
5 stars 0 forks source link

compare search functionality with intake-esm #1

Open rabernat opened 5 years ago

rabernat commented 5 years ago

https://github.com/NCAR/intake-esm https://intake-esm.readthedocs.io/en/latest/index.html

We want to make sure that the dataframes returned by both search modules are compatible with the aggregation functions. (Consistent column names, necessary fields, etc.)

andersy005 commented 5 years ago

For the time being, we are using the following columns

    - ensemble
    - experiment
    - file_basename
    - file_fullpath
    - frequency
    - institution
    - model
    - realm
    - files_dirname
    - variable
    - version

There's an example notebook here demonstrating the structure of the dataframe used as "database" of existing files.

We want to make sure that the dataframes returned by both search modules are compatible with the aggregation functions. (Consistent column names, necessary fields, etc.)

@matt-long and I are interested in this discussion. It appears that the list of columns we are using is a superset of columns being used in esgf2xarray. We can make sure that our columns names are an exact match of what is in esgf2xarray. For instance institution_id column corresponds to institution in our case.

@rabernat, when you say "necessary fields", are you referring to required fields for merging and concatenating multiple files into one xarray dataset?

I was thrilled to see the functionality in https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py. This is something that we wanted to do in intake-esm but we didn't know exactly how we were going to do it until now.

andersy005 commented 5 years ago

@matt-long, @rabernat I am also interested to hear your thoughts on how to consolidate efforts and avoid duplication of work in both esgf2xarray and intake-esm?

rabernat commented 5 years ago

First let me say that this code here is super experimental and mostly written for us to learn how things work and get some stuff done quickly. I hope this package can essentially disappear as the functionality it provides is absorbed into other packages.

It seems like we have a couple of common needs:

In esgf2xarray, we can do this without intake. Could you help me understand the role intake is playing here? If we already have a database of our files, and we know how to open them, why do we need intake any more? What value does it add? (This is not a rhetorical question; I really want to know.) To me, one shortcoming of intake for esgf-type data is that it doesn't support nested catalogs. However there are also many advantages.

It appears that the list of columns we are using is a superset of columns being used in esgf2xarray. We can make sure that our columns names are an exact match of what is in esgf2xarray. For instance institution_id column corresponds to institution in our case.

The columns that come out the search routine here are just a direct translation of the results of the ESGF search results. Maybe some of these are synonyms. What is important is to identify what set of columns are needed to accurately catalog the objects we are interested in.

I was thrilled to see the functionality in https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py. This is something that we wanted to do in intake-esm but we didn't know exactly how we were going to do it until now.

My hope is that this sort of complex, nested aggregation can eventually be accomplished via NCML. This requires https://github.com/pydata/xarray/issues/2697 getting solved. @dopplershift is potentially interested in working on implementing that feature in xarray.

The more you can assume about the consistency / homogeneity of the files you're trying to combine, the faster you can make it go. But as @naomi-henderson has often pointed out to me, CMIP tends to destroy any such assumptions!

andersy005 commented 5 years ago

in intake-esm, the files live on a disk and are discovered by crawling the directory structure (are the files actually opened, or can you get everything you need from the filenames themselves?

Everything you need is inferred from directory structure as described in CMIP Data Reference Syntax. However, we discovered some anomalies in directory structure of data hosted at NCAR. So, we introduced some fixes to address these edge cases.

Could you help me understand the role intake is playing here? If we already have a database of our files, and we know how to open them, why do we need intake any more? What value does it add?

Something that came up during discussions with @matt-long is the idea of extending the search function in intake-esm to support remote data holdings such that if you provide something like this:

intake_esm.search(mip_era='CMIP6', activity_drs='CMIP', variable="ts",
                table_id='Amon', institution_id='NASA-GISS', experiment_id='amip')

intake-esm would then do the following: