CMIP6 archive storage plan for zarr stores

dgergel commented 3 years ago

At our last CMIP6-in-the-cloud collaboration meeting (myself, @naomi-henderson and @cisaacstern), we discussed the current situation for CMIP6 archiving in the cloud. For zarr stores, the CMIP6 archive has been on GCS and is currently being backed up to AWS. However, no one is using the AWS zarr stores (to our knowledge), as that version has not been publicized. As we test the CMIP6 pangeo-forge recipe, datasets are being put onto OSN (a second location). Of course, the NetCDF CMIP6 archive is now on AWS in collaboration with ESGF and GFDL. This is all getting a bit messy and we need to figure out a long-term storage option that is clear to the public and viable in terms of size for the CMIP6 zarr stores - will the archive be split between GCS and OSN? Will we switch from GCS to AWS?

cc @naomi-henderson @cisaacstern @rabernat @aradhakrishnanGFDL @agstephens

rabernat commented 3 years ago

Thanks for sharing this concern Diana. I really value your input. I'm not sure I see things the same way, so I'll share my viewpoint very verbosely and ask some clarification questions about yours.

My view is that all the different cloud CMIP6 datasets (GCS Zarr, AWS Zarr, and AWS NetCDF) are valuable and have a distinct audience. Specifically, they are useful to people / companies who, for whatever reason, already do computing in that cloud. CEDA's data will be coming online soon, and hopefully we will also one day have CMIP6 in Azure! We should encourage as many mirrors as possible of the CMIP6 data, provided we have a reliable way to keep them in sync / up to date. That way, users can easily merge CMIP6 with whatever other business / research they have going on in that cloud.

How do you know that now one is using the AWS Zarr stores? That is surprising information to me. I have not seen any logs from either AWS or GCP. The main documentation site for the Pangeo CMIP6 Zarr project - https://pangeo-data.github.io/pangeo-cmip6-cloud/ - clearly describes both GCP and AWS datasets on equal terms, and points to up-to-date catalogs for both. Presumably the cloud providers, who are footing the bill for these public datasets, have access to that information. AWS publicized both the NetCDF and Zarr data via a blogpost. The official AWS site for the data - https://registry.opendata.aws/cmip6/- includes Zarr examples.

This is all getting a bit messy and we need to figure out a long-term storage option that is clear to the public and viable in terms of size for the CMIP6 zarr stores

I don't quite see what is "messy" about it. We have accurate catalogs of the data and a website (https://pangeo-data.github.io/pangeo-cmip6-cloud/) which tries to explain how it is organized. Could you provide a more specific description of the negative consequences of the current setup?

As for "a long-term storage option", my understanding of the current public dataset agreements is that the cloud providers are committed to long-term hosting of these datasets. As long as they are paying the bill, they ultimately make the decision about how much data to host for how long. But AWS recently committed to a large increase in the CMIP6 allocation. Are you suggesting we need to negotiate a more firm commitment from the cloud providers? What exactly would qualify as "long-term"? What size commitment is necessary to be "viable"? Of course, the true long-term steward of CMIP6 is ESGF, from whom we can theoretically always re-obtain the data. (Although I recognize it would be very hard / time consuming to replicate exactly what we have now if we lost all the Zarr data tomorrow.)

I do agree strongly we could do a better job communicating to the public about how to discover and use the data. This is true across Pangeo Forge. Right now lots of effort is focused on building up Pangeo Forge, rather than outreach. Maybe we could work with the providers to do some more blog posts? (My earlier blog post about cmip6 in the cloud) In my own opinion, the core problem here is cataloging: we desperately need to replace https://catalog.pangeo.io/ with a newer, more scalable catalog system, ideally based on STAC. Then we can catalog all our cloud data - CMIP6, the legacy catalog, the new pangeo forge datasets - in a uniform way. This is why I put so much energy into the STAC discussions.

Finally, about OSN. Here we have to really distinguish between public datasets like CMIP6, which are formally hosted by the cloud providers themselves (AWS, Google Cloud), and random datasets produced by Pangeo Forge that could come from anywhere. OSN is an option we are using when WE (Pangeo / LDEO) have to foot the bill for the storage. We do not plan to put CMIP6 data in OSN. In general, OSN is not going to be quite as performant as S3 or GCS; however, it is basically free to us, because it is sponsored by NSF, and has no egress costs. Once it is mature, Pangeo Forge is going to be stashing data in many different clouds, including OSN, with the choice of whether to host a dataset ultimately made by whoever is paying the bill for the storage. As long as we have a good catalog, it's not a problem for data to live in different locations. We can also use Pangeo Forge to populate public dataset buckets.

So in summary, I personally feel that most of the challenges around the current situation can be resolved through adoption of a uniform catalog standard across all our different cloud holdings and a pretty website for search / browse of those catalogs.

naomi-henderson commented 3 years ago

Yes, @rabernat ! I agree that the uniform catalog standard is needed to pull all of these efforts together.

One of my big concerns, which would be very much simplified by uniform catalogs, is for those who need additional datasets (e.g., those datasets not deemed 'high priority' enough to keep in the cloud or those which have been recently added, but not yet added to a cloud repo) to complete their analysis. The uniform standard would allow us to seamlessly combine both public cloud and local holdings for searching, comparing and identifying differences. Most of the CMIP6 datasets will never make it to the cloud (sub-hourly, various special purpose experiments, etc). Hopefully this effort will be so successful that the community will make it possible to put the whole 'CMIP7' datasets in the cloud.

The pretty website could then be the go-to place for both browsing/searching and for helping understand the inter-relationship between the various public cloud collections.

naomi-henderson commented 3 years ago

To clarify @dgergel , as I recall, @cisaacstern would be using the Pangeo OSN only for testing the recipes for the datasets you have identified for him to try. Once we get this working, we can use the recipes (either in a bakery or standalone) to add datasets to the GC zarr store.

Prior to the GFDL/AWS collection, I was also sensing a general confusion about sources of CMIP6 data, but mostly between the ESGF holdings, NCAR holdings, and GC/AWS holdings! I was answering the questions by distinguishing between the dataset format - netcdf vs zarr, but that is no longer correct since - a wonderful development, miracle of miracles! - GFDL has put some of the original netcdf files in AWS!

martindurant commented 3 years ago

The pretty website could then be the go-to place for both browsing/searching and for helping understand the inter-relationship between the various public cloud collections.

This sounds like a cataloguing task itself. Obviously I advocate for Intake, but I do think that, in general, a programmatic approach is best rather than having to browse a website. You should be able to do the same search/browse anywhere, so that you can directly pull what you need for analysis. Of course, that's not to say that you can't also have a pretty website interface to the same cataloguing system.

rabernat commented 3 years ago

Obviously I advocate for Intake

Martin, you always say this. But can you clarify your position a little bit? Are you really arguing that ESGF / NASA / the global geospatial community should be adopting intake as its catalog of record for billions of geospatial assets? Do you really think that intake is up to this task?

Surely you understand that:

STAC is machine readable
STAC is language agnostic (not a python-specific tool) and has clients in many languages
STAC has a massive global community and momentum among key geospatial organizations
Intake-STAC provides a bridge for python users who wish to use intake to access STAC data

in general, a programmatic approach is best rather than having to browse a website

No one ever said otherwise. Just saying that we ALSO need a website.

martindurant commented 3 years ago

I am talking about a place to explain he relationships between the different massive catalogues, as well as we are able to track them - not the sets of datasets themselves. The ability to view into other catalogue structures is unusual and very useful, and we are considering a system that is less archive, more fluid, so being able to change the descriptions on the fly is necessary. In this picture, Intake (or a similarly descriptive layer) is only a thin shim, and the great majority of the details are still in STAC format. Intake is up to this task.

rabernat commented 3 years ago

would be using the Pangeo OSN only for testing the recipes for the datasets you have identified for him to try. Once we get this working, we can use the recipes (either in a bakery or standalone) to add datasets to the GC zarr store.

I think we will keep some datasets in OSN long term (Not just for testing purposes). For example NOAA OISST, the SWOT AdAC data, etc. Just not datasets that are part of a cloud-provider-sponsored public datasets.

TomAugspurger commented 3 years ago

I am talking about a place to explain he relationships between the different massive catalogues, as well as we are able to track them - not the sets of datasets themselves.

Just noting that the STAC Catalog object handles this case too. So pangeo-forge could choose to have a root STAC catalog with with links to sub-catalogs, which could be organized by cloud or dataset (or both). And with intake-stac we'd get all the benefits of STAC while preserving the benefits of intake as a client API.

pangeo-forge / pangeo-forge-recipes

CMIP6 archive storage plan for zarr stores #156