compare search functionality with intake-esm

rabernat commented 5 years ago

https://github.com/NCAR/intake-esm https://intake-esm.readthedocs.io/en/latest/index.html

We want to make sure that the dataframes returned by both search modules are compatible with the aggregation functions. (Consistent column names, necessary fields, etc.)

andersy005 commented 5 years ago

For the time being, we are using the following columns

    - ensemble
    - experiment
    - file_basename
    - file_fullpath
    - frequency
    - institution
    - model
    - realm
    - files_dirname
    - variable
    - version

There's an example notebook here demonstrating the structure of the dataframe used as "database" of existing files.

We want to make sure that the dataframes returned by both search modules are compatible with the aggregation functions. (Consistent column names, necessary fields, etc.)

@matt-long and I are interested in this discussion. It appears that the list of columns we are using is a superset of columns being used in esgf2xarray. We can make sure that our columns names are an exact match of what is in esgf2xarray. For instance institution_id column corresponds to institution in our case.

@rabernat, when you say "necessary fields", are you referring to required fields for merging and concatenating multiple files into one xarray dataset?

I was thrilled to see the functionality in https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py. This is something that we wanted to do in intake-esm but we didn't know exactly how we were going to do it until now.

andersy005 commented 5 years ago

@matt-long, @rabernat I am also interested to hear your thoughts on how to consolidate efforts and avoid duplication of work in both esgf2xarray and intake-esm?

rabernat commented 5 years ago

First let me say that this code here is super experimental and mostly written for us to learn how things work and get some stuff done quickly. I hope this package can essentially disappear as the functionality it provides is absorbed into other packages.

It seems like we have a couple of common needs:

build lists of files from ESGF repositories into pandas dataframes for downstream processing, what you call an esm_metadatastore
- in intake-esm, the files live on a disk and are discovered by crawling the directory structure (are the files actually opened, or can you get everything you need from the filenames themselves?)
- in esgf2xarray, the "files" are not files at all but results from the ESGF search API, which may be opendap endpoints, http links, etc. But some of these (i.e. opendap) can still be opened by xarray as if they were local files
actually open these files with xarray
- maybe one at a time
- or maybe aggregated at some higher level

In esgf2xarray, we can do this without intake. Could you help me understand the role intake is playing here? If we already have a database of our files, and we know how to open them, why do we need intake any more? What value does it add? (This is not a rhetorical question; I really want to know.) To me, one shortcoming of intake for esgf-type data is that it doesn't support nested catalogs. However there are also many advantages.

It appears that the list of columns we are using is a superset of columns being used in esgf2xarray. We can make sure that our columns names are an exact match of what is in esgf2xarray. For instance institution_id column corresponds to institution in our case.

The columns that come out the search routine here are just a direct translation of the results of the ESGF search results. Maybe some of these are synonyms. What is important is to identify what set of columns are needed to accurately catalog the objects we are interested in.

I was thrilled to see the functionality in https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py. This is something that we wanted to do in intake-esm but we didn't know exactly how we were going to do it until now.

My hope is that this sort of complex, nested aggregation can eventually be accomplished via NCML. This requires https://github.com/pydata/xarray/issues/2697 getting solved. @dopplershift is potentially interested in working on implementing that feature in xarray.

The more you can assume about the consistency / homogeneity of the files you're trying to combine, the faster you can make it go. But as @naomi-henderson has often pointed out to me, CMIP tends to destroy any such assumptions!

andersy005 commented 5 years ago

in intake-esm, the files live on a disk and are discovered by crawling the directory structure (are the files actually opened, or can you get everything you need from the filenames themselves?

Everything you need is inferred from directory structure as described in CMIP Data Reference Syntax. However, we discovered some anomalies in directory structure of data hosted at NCAR. So, we introduced some fixes to address these edge cases.

Could you help me understand the role intake is playing here? If we already have a database of our files, and we know how to open them, why do we need intake any more? What value does it add?

When we started working on this, there was hope of getting NCAR's data managers involved and and persuading them of handling "central data catalogs (listings of available data files)" generation, maintenance, etc for us. The original plan was then to use intake-server. This server would take catalog files as input and make them available at some central point. Once this available, users would just need to point an intake-client to this server for discovering objects they are interested in.
Another advantage of an intake-server is that as user you wouldn't have to worry about keeping the generated pandas dataframes (catalogs/ file listings) up-to-date since this would be done for you in catalogs provided via intake-server by the data managers.
Unfortunately, we never got an opportunity to sit down with the data managers and other interested parties. So, I think that it would be fair to say that intake is not really useful in the current context (in which every user has to generate their own catalogs prior running any query against these catalogs to find what is available)

Something that came up during discussions with @matt-long is the idea of extending the search function in intake-esm to support remote data holdings such that if you provide something like this:

intake_esm.search(mip_era='CMIP6', activity_drs='CMIP', variable="ts",
                table_id='Amon', institution_id='NASA-GISS', experiment_id='amip')

intake-esm would then do the following:

First, search both local file system and esgf data nodes
Figure out what is available locally and remotely
Finally, combining a union of these two searches' results into a final dataset by fetching data from esgf node for data that is not locally available.

pangeo-data / esgf2xarray

compare search functionality with intake-esm #1