Open rabernat opened 4 years ago
@rabernat, @RuthPetrie, @naomi-henderson, sorry I couldn't make the meeting on Friday. We have decided to split our buckets and objects like this:
Advice on Caringo naming was looked at, and we moved to / - where we map the DRS to it directly except the fourth "." is a "/". E.g.:
Further discussion is here: https://github.com/cedadev/cmip6-object-store/issues/1
Thanks @agstephens!
As long as you maintain a compatible catalog format (ESM collection spec / its STAC evolution), you should be able to name the files however you want!
Hi @rabernat, Indeed any intake/stac/opensearch catalog would provide the mapping. However, we also quite like the idea of users being able to build the URL intuitively from the DRS dataset identifier, e.g.:
https://github.com/cedadev/cmip6-object-store/blob/master/cmip6_object_store/cmip6_zarr/utils.py#L76
However, we also quite like the idea of users being able to build the URL intuitively from the DRS dataset identifier, e.g.
If you go this route, I would love to hear your response to my points 1 and 2 above? How do you mitigate those issues?
I just looked over https://github.com/cedadev/cmip6-object-store in detail, the catalog module in particular. What is the format for this catalog? Is it ESM collection spec? I would definitely discourage you from defining a new bespoke catalog format / API. Instead, I encourage you to consider trying to align with the efforts around standardization of catalogs and python tools for accessing them. Yes, this is slower and more work at the outset. But I feel that the benefits in the long term are immense, and the downsides of a fragmented approach are severe. Perhaps I am misinterpreting and you are in fact using a standard catalog format...I couldn't tell immediately from the code.
We would be happy to work with you and help migrate your catalog to a standard format such as ESM collection spec, or, even better, STAC with the new esm-collection
extension (see https://github.com/NCAR/esm-collection-spec/pull/30).
That said, beyond the catalog question, I see a lot of very useful things in your repo that I feel would be helpful to our broader efforts. For example, your ZarrWriter class (https://github.com/cedadev/cmip6-object-store/blob/master/cmip6_object_store/cmip6_zarr/zarr_writer.py) has a nice hook for updating the catalog after the data are successfully written.
@naomi-henderson, @dgergel, we should look though their repo closely together and see which aspects we might want to into our pipeline. Ideally we can move towards using a shared codebase.
@charlesbluca and I met today and discussed this.
It seems like there are two different approaches being used for catalogging:
It seems like we can have both.
I would propose that we develop a tool to index a bucket, generating a CSV, STAC catalog, whatever based on the contents that it finds. Essentially, we would use the bucket itself as our database. For this to work, we need two things:
INSTUTION_ID
, ACTIVITY_ID
, etc.)A good path forward may be to try to collaborate on such a tool.
If we had this tool, we could use it for many things:
Do folks think this is a good idea?
This sounds good to me. I would propose the following modification for the first need:
This is particularly important once we start further automating the collection/concatenation procedure, since checking the datasets against the official CMIP6 errata pages is very quirky and difficult to automate.
I think a tool of this nature would go a long way in helping automate the various tasks related to copying/syncing/cataloging data. One thing to consider if using an index like this for synchronization purposes is that it should include either modification time or checksum of files, which would have some implications on the overall runtime of an indexing operation.
Some tools we discussed that could perform the basic indexing task:
I'm interested in seeing how the performances of these tools stack up, so I think I'm going to put together some basic cron jobs to see how they compare when listing s3:cmip6-pds
.
Hi @rabernat and @charlesbluca Nice.. my line of thoughts are also similar to the proposed idea. I support the idea of a generalized tool to build catalogs. If I am understanding it correctly, the following work I've been doing with an intern (who will also be presenting some of this at AGU 2020) may be relevant?
https://github.com/aradhakrishnanGFDL/CatalogBuilder/
The catalog builder is a pretty simple package for generating intake-esm CSVs (such as the CSV we provided for the diff-ing. I believe I was informed the nc-file-granularity in the CSV was necessary to get intake-esm to work on NetCDF..compared to zarr stores) both on our local spooky dark repository (UDA) and in S3. A valid CMIP DRS structure, or at the least a CMIP compliant file name (for netCDF) is the current assumption for the catalog builder. We are at the preliminary stages of testing it.
After hearing about STAC and the S3 bucket organization Ag pointed to, I have been thinking how to make such a catalog builder more generalized and extensible, though we're currently sticking to intake-esm. The "clear mapping from paths in the cloud to the CMIP6 CV" as you point out will be key.
@rabernat: sorry for the delay in responding to you...
However, we also quite like the idea of users being able to build the URL intuitively from the DRS dataset identifier, e.g.
If you go this route, I would love to hear your response to my points 1 and 2 above? How do you mitigate those issues?
Having reviewed your points 1 and 2 - I think that they stand up well. You are right that the user/client needs layers of exception handling in order for that to work well.
Let me add a separate comment about my catalogue
module and catalogues in general...
@rabernat @aradhakrishnanGFDL @naomi-henderson @charlesbluca: here is some clarification and some thoughts about cataloguing our CMIP6 holdings:
My catalogue.py
module (https://github.com/cedadev/cmip6-object-store/blob/master/cmip6_object_store/cmip6_zarr/catalogue.py):
catalogue
has just caused confusion - it is not meant to be a user catalog at all :-)So, what do I think a catalog
should look like?
intake-esm
I am very keen for us to conform to a common approach, and a common codebase.
What do you all think about the idea of defining a data model as a starting point? And if we did that, what is the most useful/intuitive format for storing the content (which might get big)?
@aradhakrishnanGFDL , maybe your https://github.com/aradhakrishnanGFDL/CatalogBuilder is a good starting point towards this.
And to those in the US - thanks for giving us renewed hope!
@charlesbluca I can see that #9 makes the case for having a common canonical format. I like that idea.
@rabernat Catching up with this thread. Firstly, in summary it would be great to co-ordinate and collaborate on a profile for STAC
For ESGF, we want to move away from the custom API we have and adopt a standard which a community/communities can agree on. As @agstephens mentioned we have worked with OpenSearch and also a number of other standards in this area over the years. We have prototyped CMIP6 catalogues with OpenSearch and are using it in production with various Earth Observation datasets. I see anecdotally groups in this domain are moving away from OpenSearch to STAC. However, with all these things it takes time and thought to make sure we get as good a solution as we can.
On the separate point:
@agstephens and @aradhakrishnanGFDL are relying on paths and a path convention to determine what data is in the bucket
For what's it's worth I don't think Ag is necessarily advocating that. I think the point is that a meaningful path can have value to the user even if it is not used programmatically in potentially expensive operations walking the storage system (be it object store or anything else).
To some extent this is by the by if we have a good search API we can agree on :) So as I said above it would be great to work with those interested on something with STAC. We could discuss in the next regular CMIP6 cloud catch-up call.
I'm opening this issue to follow up on the discussion we had at today's meeting. It would be great to align on conventions regarding how these cloud data are organized and catalogs. This will allow users to move freely between the different stores with minimum frictions.
@RuthPetrie mentioned that CEDA was trying to figure out the optimal directory structure for their data. Our directory structure was documented by @charlesbluca and @naomi-henderson here: https://pangeo-data.github.io/pangeo-cmip6-cloud/overview.html#directory-structure
I also made the point that I think it's better if we not force users to rely on assumptions about directory structure in their analysis code. It's better to think of the directory structure as ephemeral and changeable The reason are:
Instead, I advocated for having all data requests go through a catalog. This doesn't have to be heavy-handed or complex. At its core, a catalog is a mapping between a dataset ID and a storage path. Working with NCAR and @andersy005, we defined ESM collection spec: https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md This is how all of our cloud data is currently cataloged. ESM collections spec uses a very simple CSV file that anyone can open and parse.
Work is underway to align with STAC (see https://github.com/NCAR/esm-collection-spec/issues/21), although this has stalled a bit due to lack of effort. We should definitely try to revive this as I believe strongly that STAC is the future for cloud data catalogs.
Whatever we choose, it's very important that we align going forward.
cc @pangeo-forge/cmip6