pangeo-data / pangeo-datastore

Pangeo Cloud Datastore
https://catalog.pangeo.io
48 stars 16 forks source link

Rethinking Pangeo's data catalogs with consideration to pangeo-forge #118

Closed charlesbluca closed 3 years ago

charlesbluca commented 3 years ago

As progress is made on pangeo-forge (especially towards the automation of catalog generation for published datasets) and activities on this repository dwindle, I think it's worthwhile to discuss the role of Pangeo's data catalogs moving forward.

Currently, there are two copies of these catalogs, both using different specifications:

In general, there has been a great push to get much of the functionality we have with Intake catalogs into STAC, with tools like Intake-STAC, a pending ESM extension, and a Zarr component for STAC Browser, but there is still enough work to be done where it is safe to say the Intake catalogs are preferred for opening the data into xarray containers.

On the web side of things, there are two websites that allow users to crawl the Intake and STAC catalogs, respectively:

Both options have their own unique issues: Pangeo Catalog has very long load times and frequent server failures since it is opening datasets in place to render their previews and there is no persistent caching system in place; STAC Browser doesn't yet offer a way to render ESM collections, and cannot render datasets contained in requester pays buckets (this could potentially be fixed by setting up Google Cloud application default credentials before generating the site).

Finally, we have pangeo-forge - though this project is not yet fully functional, there has been discussion suggesting that the process of publishing and cataloging datasets should be tied together, and generalized so that users can specify their own endpoint to publish the catalogs to (e.g. the bucket the datasets reside in). As this project progresses, we are likely to see more user-published catalogs residing in the cloud, with no "master" catalog listing them out.

All of this considered, some questions I have are:

My final question points more to the existence of this repo - who uses these catalogs? Unfortunately, we have little ability to measure how much individual datasets in these catalogs are being accessed beyond how many issues are opened concerning them or general plots of how much their containing buckets have been requested. @rabernat made the suggestion to have the catalogs link to an API instead of the direct dataset asset files - this API could then redirect users to the requested assets while logging the fact that they were requested in a database of some sort. This seems like a great start, but I'm interested in if there are potentially simpler solutions we've been ignoring.

cc @jhamman