Discussion of directory structure and catalog options

rabernat commented 4 years ago

I'm opening this issue to follow up on the discussion we had at today's meeting. It would be great to align on conventions regarding how these cloud data are organized and catalogs. This will allow users to move freely between the different stores with minimum frictions.

@RuthPetrie mentioned that CEDA was trying to figure out the optimal directory structure for their data. Our directory structure was documented by @charlesbluca and @naomi-henderson here: https://pangeo-data.github.io/pangeo-cmip6-cloud/overview.html#directory-structure

I also made the point that I think it's better if we not force users to rely on assumptions about directory structure in their analysis code. It's better to think of the directory structure as ephemeral and changeable The reason are:

We don't want to get locked in to a specific storage location. We want to have the flexibility be able to move data (across buckets, cloud, etc.) in the future.
If we rely just on directories, in order to discover what data is actually available, users will need to either
1. list the bucket (expensive, slow, might be impossible without credentials)
2. have a bunch of try / fail logic to deal with data they expect to be there but is missing

Instead, I advocated for having all data requests go through a catalog. This doesn't have to be heavy-handed or complex. At its core, a catalog is a mapping between a dataset ID and a storage path. Working with NCAR and @andersy005, we defined ESM collection spec: https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md This is how all of our cloud data is currently cataloged. ESM collections spec uses a very simple CSV file that anyone can open and parse.

Work is underway to align with STAC (see https://github.com/NCAR/esm-collection-spec/issues/21), although this has stalled a bit due to lack of effort. We should definitely try to revive this as I believe strongly that STAC is the future for cloud data catalogs.

Whatever we choose, it's very important that we align going forward.

cc @pangeo-forge/cmip6

agstephens commented 4 years ago

@rabernat, @RuthPetrie, @naomi-henderson, sorry I couldn't make the meeting on Friday. We have decided to split our buckets and objects like this:

Advice on Caringo naming was looked at, and we moved to / - where we map the DRS to it directly except the fourth "." is a "/". E.g.:

http://cmip6-zarr-o.s3.jc.rl.ac.uk/CMIP6.AerChemMIP.NIMS-KMA.UKESM1-0-LL/hist-piNTCF.r3i1p1f2.Amon.evspsbl.gn.v20200224.zarr

Further discussion is here: https://github.com/cedadev/cmip6-object-store/issues/1

rabernat commented 4 years ago

Thanks @agstephens!

As long as you maintain a compatible catalog format (ESM collection spec / its STAC evolution), you should be able to name the files however you want!

agstephens commented 4 years ago

Hi @rabernat, Indeed any intake/stac/opensearch catalog would provide the mapping. However, we also quite like the idea of users being able to build the URL intuitively from the DRS dataset identifier, e.g.:

https://github.com/cedadev/cmip6-object-store/blob/master/cmip6_object_store/cmip6_zarr/utils.py#L76

rabernat commented 4 years ago

However, we also quite like the idea of users being able to build the URL intuitively from the DRS dataset identifier, e.g.

If you go this route, I would love to hear your response to my points 1 and 2 above? How do you mitigate those issues?

I just looked over https://github.com/cedadev/cmip6-object-store in detail, the catalog module in particular. What is the format for this catalog? Is it ESM collection spec? I would definitely discourage you from defining a new bespoke catalog format / API. Instead, I encourage you to consider trying to align with the efforts around standardization of catalogs and python tools for accessing them. Yes, this is slower and more work at the outset. But I feel that the benefits in the long term are immense, and the downsides of a fragmented approach are severe. Perhaps I am misinterpreting and you are in fact using a standard catalog format...I couldn't tell immediately from the code.

We would be happy to work with you and help migrate your catalog to a standard format such as ESM collection spec, or, even better, STAC with the new esm-collection extension (see https://github.com/NCAR/esm-collection-spec/pull/30).

That said, beyond the catalog question, I see a lot of very useful things in your repo that I feel would be helpful to our broader efforts. For example, your ZarrWriter class (https://github.com/cedadev/cmip6-object-store/blob/master/cmip6_object_store/cmip6_zarr/zarr_writer.py) has a nice hook for updating the catalog after the data are successfully written.

@naomi-henderson, @dgergel, we should look though their repo closely together and see which aspects we might want to into our pipeline. Ideally we can move towards using a shared codebase.

rabernat commented 4 years ago

@charlesbluca and I met today and discussed this.

It seems like there are two different approaches being used for catalogging:

We at LDEO use a CSV file to enumerate all the different datasets in the bucket
@agstephens and @aradhakrishnanGFDL are relying on paths and a path convention to determine what data is in the bucket

It seems like we can have both.

I would propose that we develop a tool to index a bucket, generating a CSV, STAC catalog, whatever based on the contents that it finds. Essentially, we would use the bucket itself as our database. For this to work, we need two things:

All the data in a particular bucket (or sub-path) is considered "good" (to be included in the catalog), or else we need some flag for bad / non-QC'd data
We need a clear mapping from paths in the cloud to the CMIP6 controlled vocabulary (e.g. INSTUTION_ID, ACTIVITY_ID, etc.)

A good path forward may be to try to collaborate on such a tool.

If we had this tool, we could use it for many things:

Automatically building STAC catalogs, bigquery indexes, etc
Keeping different repos in sync (e.g. cp from GCS to S3)
Staging data from GFDL netCDF to Zarr

Do folks think this is a good idea?

naomi-henderson commented 4 years ago

This sounds good to me. I would propose the following modification for the first need:

All the data in a particular bucket (or sub-path) is considered "as-is" and included in the catalog. The Quality Control is left to pre-processing - which can be responsible for eliminating or fixing bad datasets.

This is particularly important once we start further automating the collection/concatenation procedure, since checking the datasets against the official CMIP6 errata pages is very quirky and difficult to automate.

charlesbluca commented 4 years ago

I think a tool of this nature would go a long way in helping automate the various tasks related to copying/syncing/cataloging data. One thing to consider if using an index like this for synchronization purposes is that it should include either modification time or checksum of files, which would have some implications on the overall runtime of an indexing operation.

Some tools we discussed that could perform the basic indexing task:

Google Cloud/AWS CLI commands
S3P (not sure if this can be used for Google Cloud buckets)
rclone

I'm interested in seeing how the performances of these tools stack up, so I think I'm going to put together some basic cron jobs to see how they compare when listing s3:cmip6-pds.

aradhakrishnanGFDL commented 4 years ago

Hi @rabernat and @charlesbluca Nice.. my line of thoughts are also similar to the proposed idea. I support the idea of a generalized tool to build catalogs. If I am understanding it correctly, the following work I've been doing with an intern (who will also be presenting some of this at AGU 2020) may be relevant?

https://github.com/aradhakrishnanGFDL/CatalogBuilder/

The catalog builder is a pretty simple package for generating intake-esm CSVs (such as the CSV we provided for the diff-ing. I believe I was informed the nc-file-granularity in the CSV was necessary to get intake-esm to work on NetCDF..compared to zarr stores) both on our local spooky dark repository (UDA) and in S3. A valid CMIP DRS structure, or at the least a CMIP compliant file name (for netCDF) is the current assumption for the catalog builder. We are at the preliminary stages of testing it.

After hearing about STAC and the S3 bucket organization Ag pointed to, I have been thinking how to make such a catalog builder more generalized and extensible, though we're currently sticking to intake-esm. The "clear mapping from paths in the cloud to the CMIP6 CV" as you point out will be key.

agstephens commented 3 years ago

@rabernat: sorry for the delay in responding to you...

However, we also quite like the idea of users being able to build the URL intuitively from the DRS dataset identifier, e.g.

If you go this route, I would love to hear your response to my points 1 and 2 above? How do you mitigate those issues?

Having reviewed your points 1 and 2 - I think that they stand up well. You are right that the user/client needs layers of exception handling in order for that to work well.

Let me add a separate comment about my catalogue module and catalogues in general...

agstephens commented 3 years ago

@rabernat @aradhakrishnanGFDL @naomi-henderson @charlesbluca: here is some clarification and some thoughts about cataloguing our CMIP6 holdings:

My catalogue.py module (https://github.com/cedadev/cmip6-object-store/blob/master/cmip6_object_store/cmip6_zarr/catalogue.py):
- is not really a catalog in the sense of a user catalog.
- is just a pickled dictionary - I created it to safely log errors from 100s of parallel processes that are running on our batch cluster - and they need to put their success or failure results into a common location
- calling it catalogue has just caused confusion - it is not meant to be a user catalog at all :-)
So, what do I think a catalog should look like?
- ideally a canonical form that has an internal data model that can easily be presented as:
- intake-esm
- STAC (esm-collection)
- OpenSearch
- other....
- so maybe we want to have multiple input adaptors such as:
- S3Reader
- FileSystemReader
- DictReader

I am very keen for us to conform to a common approach, and a common codebase.

What do you all think about the idea of defining a data model as a starting point? And if we did that, what is the most useful/intuitive format for storing the content (which might get big)?

@aradhakrishnanGFDL , maybe your https://github.com/aradhakrishnanGFDL/CatalogBuilder is a good starting point towards this.

And to those in the US - thanks for giving us renewed hope!

agstephens commented 3 years ago

@charlesbluca I can see that #9 makes the case for having a common canonical format. I like that idea.

philipkershaw commented 3 years ago

@rabernat Catching up with this thread. Firstly, in summary it would be great to co-ordinate and collaborate on a profile for STAC

https://github.com/NCAR/esm-collection-spec/issues/21

For ESGF, we want to move away from the custom API we have and adopt a standard which a community/communities can agree on. As @agstephens mentioned we have worked with OpenSearch and also a number of other standards in this area over the years. We have prototyped CMIP6 catalogues with OpenSearch and are using it in production with various Earth Observation datasets. I see anecdotally groups in this domain are moving away from OpenSearch to STAC. However, with all these things it takes time and thought to make sure we get as good a solution as we can.

On the separate point:

@agstephens and @aradhakrishnanGFDL are relying on paths and a path convention to determine what data is in the bucket

For what's it's worth I don't think Ag is necessarily advocating that. I think the point is that a meaningful path can have value to the user even if it is not used programmatically in potentially expensive operations walking the storage system (be it object store or anything else).

To some extent this is by the by if we have a good search API we can agree on :) So as I said above it would be great to work with those interested on something with STAC. We could discuss in the next regular CMIP6 cloud catch-up call.

pangeo-forge / cmip6-pipeline

Discussion of directory structure and catalog options #7