Open rabernat opened 5 years ago
For the time being, we are using the following columns
- ensemble
- experiment
- file_basename
- file_fullpath
- frequency
- institution
- model
- realm
- files_dirname
- variable
- version
There's an example notebook here demonstrating the structure of the dataframe used as "database" of existing files.
We want to make sure that the dataframes returned by both search modules are compatible with the aggregation functions. (Consistent column names, necessary fields, etc.)
@matt-long and I are interested in this discussion. It appears that the list of columns we are using is a superset of columns being used in esgf2xarray
. We can make sure that our columns names are an exact match of what is in esgf2xarray
. For instance institution_id
column corresponds to institution
in our case.
@rabernat, when you say "necessary fields", are you referring to required fields for merging and concatenating multiple files into one xarray dataset?
I was thrilled to see the functionality in https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py. This is something that we wanted to do in intake-esm
but we didn't know exactly how we were going to do it until now.
@matt-long, @rabernat I am also interested to hear your thoughts on how to consolidate efforts and avoid duplication of work in both esgf2xarray
and intake-esm
?
First let me say that this code here is super experimental and mostly written for us to learn how things work and get some stuff done quickly. I hope this package can essentially disappear as the functionality it provides is absorbed into other packages.
It seems like we have a couple of common needs:
In esgf2xarray, we can do this without intake. Could you help me understand the role intake is playing here? If we already have a database of our files, and we know how to open them, why do we need intake any more? What value does it add? (This is not a rhetorical question; I really want to know.) To me, one shortcoming of intake for esgf-type data is that it doesn't support nested catalogs. However there are also many advantages.
It appears that the list of columns we are using is a superset of columns being used in
esgf2xarray
. We can make sure that our columns names are an exact match of what is inesgf2xarray
. For instanceinstitution_id
column corresponds toinstitution
in our case.
The columns that come out the search routine here are just a direct translation of the results of the ESGF search results. Maybe some of these are synonyms. What is important is to identify what set of columns are needed to accurately catalog the objects we are interested in.
I was thrilled to see the functionality in https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py. This is something that we wanted to do in
intake-esm
but we didn't know exactly how we were going to do it until now.
My hope is that this sort of complex, nested aggregation can eventually be accomplished via NCML. This requires https://github.com/pydata/xarray/issues/2697 getting solved. @dopplershift is potentially interested in working on implementing that feature in xarray.
The more you can assume about the consistency / homogeneity of the files you're trying to combine, the faster you can make it go. But as @naomi-henderson has often pointed out to me, CMIP tends to destroy any such assumptions!
in intake-esm, the files live on a disk and are discovered by crawling the directory structure (are the files actually opened, or can you get everything you need from the filenames themselves?
Everything you need is inferred from directory structure as described in CMIP Data Reference Syntax. However, we discovered some anomalies in directory structure of data hosted at NCAR. So, we introduced some fixes to address these edge cases.
Could you help me understand the role intake is playing here? If we already have a database of our files, and we know how to open them, why do we need intake any more? What value does it add?
When we started working on this, there was hope of getting NCAR's data managers involved and and persuading them of handling "central data catalogs (listings of available data files)" generation, maintenance, etc for us. The original plan was then to use intake-server. This server would take catalog files as input and make them available at some central point. Once this available, users would just need to point an intake-client to this server for discovering objects they are interested in.
Another advantage of an intake-server is that as user you wouldn't have to worry about keeping the generated pandas dataframes (catalogs/ file listings) up-to-date since this would be done for you in catalogs provided via intake-server by the data managers.
Unfortunately, we never got an opportunity to sit down with the data managers and other interested parties. So, I think that it would be fair to say that intake
is not really useful in the current context (in which every user has to generate their own catalogs prior running any query against these catalogs to find what is available)
Something that came up during discussions with @matt-long is the idea of extending the search function in intake-esm
to support remote data holdings such that if you provide something like this:
intake_esm.search(mip_era='CMIP6', activity_drs='CMIP', variable="ts",
table_id='Amon', institution_id='NASA-GISS', experiment_id='amip')
intake-esm
would then do the following:
https://github.com/NCAR/intake-esm https://intake-esm.readthedocs.io/en/latest/index.html
We want to make sure that the dataframes returned by both search modules are compatible with the aggregation functions. (Consistent column names, necessary fields, etc.)