As progress is made on pangeo-forge (especially towards the automation of catalog generation for published datasets) and activities on this repository dwindle, I think it's worthwhile to discuss the role of Pangeo's data catalogs moving forward.
Currently, there are two copies of these catalogs, both using different specifications:
Intake catalogs, which are generated and updated through manual pull requests
STAC catalogs, which are generated from the Intake catalogs and are currently out of date
In general, there has been a great push to get much of the functionality we have with Intake catalogs into STAC, with tools like Intake-STAC, a pending ESM extension, and a Zarr component for STAC Browser, but there is still enough work to be done where it is safe to say the Intake catalogs are preferred for opening the data into xarray containers.
On the web side of things, there are two websites that allow users to crawl the Intake and STAC catalogs, respectively:
Pangeo Catalog, a Flask-based dynamic website deployed through Google App Engine
STAC Browser, a Vue-based static website which offers flexible hosting options
Both options have their own unique issues: Pangeo Catalog has very long load times and frequent server failures since it is opening datasets in place to render their previews and there is no persistent caching system in place; STAC Browser doesn't yet offer a way to render ESM collections, and cannot render datasets contained in requester pays buckets (this could potentially be fixed by setting up Google Cloud application default credentials before generating the site).
Finally, we have pangeo-forge - though this project is not yet fully functional, there has been discussion suggesting that the process of publishing and cataloging datasets should be tied together, and generalized so that users can specify their own endpoint to publish the catalogs to (e.g. the bucket the datasets reside in). As this project progresses, we are likely to see more user-published catalogs residing in the cloud, with no "master" catalog listing them out.
All of this considered, some questions I have are:
Should the catalogs in this repo be made a use case for pangeo-forge? This would involve defining recipes for each of the datasets currently cataloged, and specifying to publish their associated catalogs either to this repo or elsewhere.
Should these catalogs remain centralized? If we opt to publish them outside of this repo, they could remain in a dedicated bucket together, or be published separately to their relevant buckets. From here, a tool like STAC Index could be used to enumerate all of these decentralized catalogs in one place.
Should these catalogs be Intake, STAC, or both? Though it isn't hard to generate STAC catalogs from Intake or vice versa, this process becomes more complex if we introduce decentralized catalogs. In this case, it may be easier to focus on publishing catalogs in one format.
My final question points more to the existence of this repo - who uses these catalogs? Unfortunately, we have little ability to measure how much individual datasets in these catalogs are being accessed beyond how many issues are opened concerning them or general plots of how much their containing buckets have been requested. @rabernat made the suggestion to have the catalogs link to an API instead of the direct dataset asset files - this API could then redirect users to the requested assets while logging the fact that they were requested in a database of some sort. This seems like a great start, but I'm interested in if there are potentially simpler solutions we've been ignoring.
As progress is made on pangeo-forge (especially towards the automation of catalog generation for published datasets) and activities on this repository dwindle, I think it's worthwhile to discuss the role of Pangeo's data catalogs moving forward.
Currently, there are two copies of these catalogs, both using different specifications:
In general, there has been a great push to get much of the functionality we have with Intake catalogs into STAC, with tools like Intake-STAC, a pending ESM extension, and a Zarr component for STAC Browser, but there is still enough work to be done where it is safe to say the Intake catalogs are preferred for opening the data into xarray containers.
On the web side of things, there are two websites that allow users to crawl the Intake and STAC catalogs, respectively:
Both options have their own unique issues: Pangeo Catalog has very long load times and frequent server failures since it is opening datasets in place to render their previews and there is no persistent caching system in place; STAC Browser doesn't yet offer a way to render ESM collections, and cannot render datasets contained in requester pays buckets (this could potentially be fixed by setting up Google Cloud application default credentials before generating the site).
Finally, we have pangeo-forge - though this project is not yet fully functional, there has been discussion suggesting that the process of publishing and cataloging datasets should be tied together, and generalized so that users can specify their own endpoint to publish the catalogs to (e.g. the bucket the datasets reside in). As this project progresses, we are likely to see more user-published catalogs residing in the cloud, with no "master" catalog listing them out.
All of this considered, some questions I have are:
My final question points more to the existence of this repo - who uses these catalogs? Unfortunately, we have little ability to measure how much individual datasets in these catalogs are being accessed beyond how many issues are opened concerning them or general plots of how much their containing buckets have been requested. @rabernat made the suggestion to have the catalogs link to an API instead of the direct dataset asset files - this API could then redirect users to the requested assets while logging the fact that they were requested in a database of some sort. This seems like a great start, but I'm interested in if there are potentially simpler solutions we've been ignoring.
cc @jhamman