STAC catalog sprint: to-do items

cisaacstern commented 2 years ago

[ ] Decide on mapping between storage target layout and STAC structure
- Current target layout is {prefix}/pangeo-forge/{feedstock_name}/{recipe_name}.{dataset_type}
- Clearly the pangeo-forge level should be represented by a STAC Catalog
- It seems that current practice may already be deviating somewhat from the target layout structure as defined above. Specifically, storage targets tend to have an intermediate layer in between pangeo-forge and {feedstock_name} corresponding to the project or, we might say, "collection" to which the recipe belongs. For example, in the case of the swot-adac project, each feedstock (i.e. PR repo corresponding to a particular model output), is stored with a target path in the style of pangeo-forge/swot-adac/{feedstock_name}/{recipe_name}.zarr
- This intermediate organization layer lends itself to representation by a STAC Collection (or, perhaps a sub-catalog of the primary catalog, but I have yet to confirm whether STAC Catalogs can wrap other Catalogs). Generally speaking, this "sub-catalog/collection" layer identifier has not been represented within the PR itself, but rather has just been "understood" by those working on the PRs, or in certain cases flagged using a GitHub label. We should figure out how we want to encode this information into incoming PRs. This topic is revisited in the next to-do item.
- Finally, we'll need to decide on an approach (at least provisionally) for how to represent built recipes (i.e. zarr stores) in STAC. Having considered the thread at https://discourse.pangeo.io/t/stac-and-earth-systems-datasets/1472/14 at some length, and also discussed synchronously with @cedadev and as part of a breakout session at the ESIP Summer Meeting, I believe following the CEDA approach of representing Zarr stores as STAC Items does makes sense for our use case, pending further experimentation to confirm that all of our relevant metadata (datetime ranges, e.g.) can be captured by an Item representation. My primary motivator here is a sense that maintaining the STAC model's full hierarchical "depth" will give us the flexibility to convey listing in a more intelligible fashion than would be possible otherwise.
[ ] Determine how & when (in the recipe workflow) we want to build STAC objects
- I believe we'll ultimately end up wanting to include a build_stac method or similar within the XarrayZarrRecipe class, and that this method is something which we'll want to call at recipe build time (after finalize_target), so that it can draw upon additional metadata fields (either pre-existing, or which we will add for this purpose) in the recipe's meta.yaml file. While certain cataloging information can be extracted from the zarr metadata directly via xarray (e.g., spatiotemporal extent, variable names, etc.), other key metadata for building expressive STAC objects will need to be fed from outside the dataset (i.e., long-form description, provider url, license, etc.). Much of this is already captured in meta.yaml, so this is a natural place to pull it from.
  - For extracting metadata from xarray and building pystac-backed STAC objects, https://github.com/TomAugspurger/xstac is a good baseline, but may need extension for our specific use case(s)
[x] Clarify how (technically and aesthetically) we want to run a STAC Browser instance alongside https://pangeo-forge.org
- Some relevant prior discussion: https://github.com/pangeo-forge/pangeo-forge-vue-website/issues/1
- Hosting at https://pangeo-forge.org/catalog makes sense and mirrors the layout of https://stacindex.org/catalogs
- Another option (either as a redirect, or standalone) is to host at https://catalog.pangeo.io. Caution would need be to be exercised for this option, so as not to remove a link to the legacy catalog that others may be reliant on.
[x] Consider staged rollout of where/how STAC objects are stored
- A database with an API endpoint makes the most sense middle-term, but on the very short term, a simpler solution which allows us to hammer-out other points on this list is to store prototype catalogs on GitHub.
- When it comes time to implement a database/API, there will be further choices to be made there. IIUC, CEDA has swapped the out-of-the-box Postgres backend of https://github.com/stac-utils/stac-fastapi for their preference of ElasticSearch. https://github.com/stac-utils/stac-server is another reference to consider.
- For the short-term GitHub storage implementation, here's one example item I've written manually (note that some of the fields are totally incorrect, but I just wanted to fill in something to get passed pystac's type checker), which can be opened in a notebook as demonstrated here. This loading syntax is referenced from https://planetarycomputer.microsoft.com/dataset/daymet-monthly-hi#Example-Notebook and discussed further in the next to-do item.
[ ] Agree upon a recommended loading syntax
- The pystac-based loading linked above currently appears to be the best way to open zarr datasets from STAC
- An (WIP) alternative using intake-stac is under discussion here https://github.com/intake/intake-stac/pull/90
[ ] Customizing the STAC Browser instance
- This may include CSS to match the Pangeo stylistic pallet as well as templating adjustments if we want to include custom fields in the listings such as Binder links to working notebook examples

cc @rabernat, @sharkinsspatial, @TomAugspurger so you're aware of current progress on this

TomAugspurger commented 2 years ago

I thought I left a comment here, but apparently not. The gist of it was about the item:

For extracting metadata from xarray and building pystac-backed STAC objects, https://github.com/TomAugspurger/xstac is a good baseline, but may need extension for our specific use case(s)

The datacube STAC extension applies to both Collections and Items. I think that xstac could be updated pretty easily to support items too. Let me know if you're interested in working on that.

cisaacstern commented 2 years ago

Let me know if you're interested in working on that.

Thanks, Tom. I definitely am. Once I get a little closer to figuring out what functionality we need, I'll follow up directly on xstac with an Issue or draft PR for discussion.

rabernat commented 2 years ago

This is a great list. I'd love to try to help however I can. A few comments.

It seems that current practice may already be deviating somewhat from the target layout structure as defined above.

We should fix this! We are still very early in this project. We don't need to worry about preserving bad decisions to maintain some vague backwards compatibility. We should get consistent by either changing the file paths or changing the spec. pangeo-forge/swot-adac/{feedstock_name}/{recipe_name}.zarr is out of spec with ADR 3. There is one outstanding question in ADR 3, which we may now be ready to revisit: do we want to allow recipe_name to include "sub-directories"?

I have yet to confirm whether STAC Catalogs can wrap other Catalogs

Yes they can. You can have as many layers of nesting as you want, by linking to other catalogs from catalogs. See STAC catalog overview.

I believe following the CEDA approach of representing Zarr stores as STAC Items does makes sense for our use case,

:+1: from me

A database with an API endpoint makes the most sense middle-term, but on the very short term, a simpler solution which allows us to hammer-out other points on this list is to store prototype catalogs on GitHub.

:+1: from me

The pystac-based loading linked above currently appears to be the best way to open zarr datasets from STAC

Can pystac load Zarr STAC items?

cisaacstern commented 2 years ago

Yes they can. You can have as many layers of nesting as you want, by linking to other catalogs from catalogs. See STAC catalog overview.

Yeah! Figured this out somewhere along the way too. My current thought (formal write-up to follow, which can tie in your ADR notes above) ... is that our mapping should be:

Layer	Object Name	STAC Type	Example
top	`pangeo-forge-catalog.json`	Catalog	`pangeo-forge-catalog.json`
middle-high	`{{ feedstock_name }}-catalog.json`	Catalog	`swot-adac-catalog.json`
middle-low	`{{ collection_name }}-collection.json`	Collection	`gigatl-collection.json`
bottom	`{{ dataset_unique_identifier }}.json`	Item	`region01-surf-fma.json`

This is the pattern I've followed in https://github.com/pangeo-forge/pangeo-forge-catalog/tree/dev/stac (which I just pushed, for discussion purposes).

Can pystac load Zarr STAC items?

It requires a few lines (or wrapping them in a function), but yeah! Check this out: https://nbviewer.jupyter.org/github/cisaacstern/stac-notebooks/blob/gigatl-reg01-surf-fma/example_notebook.ipynb (where opener is defined here).

cisaacstern commented 2 years ago

[x] Clarify how (technically and aesthetically) we want to run a STAC Browser instance

A prototyped version of this was completed by https://github.com/pangeo-forge/pangeo-forge-vue-website/pull/6 and should be usable at https://pangeo-forge.org/catalog once Netlify rebuilds. As noted in my last comment on that PR, this code will likely need to be refactored once STAC Browser components become installable directly from npm.

[x] Consider staged rollout of where/how STAC objects are stored

With Ryan's thumbs up on using GItHub to start, I'm going to consider that the agreed-upon approach for the time being. This is where I'm directing our STAC Browser to the Pangeo Forge root catalog: https://github.com/pangeo-forge/pangeo-forge-vue-website/blob/main/src/main.js#L14. This can be changed over to a database-backed API whenever we see fit.

[ ] Customizing the STAC Browser instance

I've recorded a number of specific points for consideration on this topic here: https://github.com/pangeo-forge/pangeo-forge-vue-website/pull/6#issuecomment-888875862.

Ok! More to follow tomorrow. That was a big push and I think I'm ready for a break. 😅

rabernat commented 2 years ago

That was a big push and I think I'm ready for a break

Well deserved! 🏆 for pushing this difficult and uncertain task forward with minimal guidance. 👏 👏 👏

cisaacstern commented 2 years ago

Following https://github.com/pangeo-forge/pangeo-forge-vue-website/pull/7, prototype catalog is now up at https://pangeo-forge.org/catalog#/.

sharkinsspatial commented 2 years ago

This is great @cisaacstern 🎊 . Apologies for the radio silence as you have been building this out as I've been out of the office (STAC is the one area of this project where I actually have some experience and might be able to contribute :]). I'll try to address points in the order described in your initial comment.

I think the hierarchichal stucture you have described for static STAC catalogs here is spot on. I would suggest however that we try to align the json file naming with the STAC Best Practices document. Though this generic naming convention is not dictated by the spec it is the convention used by most large catalogs. This change would result in

Layer	Object Name	STAC Type	Example
top	`catalog.json`	Catalog	`/pangeo-forge/catalog.json`
middle-high	`/pangeo-forge/{{ feedstock_name }}/catalog.json`	Catalog	`/pangeo-forge/swot-adac/catalog.json`
middle-low	`/pangeo-forge/{{ collection_name }}/collection.json`	Collection	`/pangeo-forge/gigatl/collection.json`
bottom	`{{ dataset_unique_identifier }}.json`	Item	`region01-surf-fma.json`

We are in a somewhat unique position because we creating a catalog referencing subcatalogs for data hosted on a variety of cloud storage providers (there is a legacy issue discussing this case in stac-spec that I will try to locate). There are open questions about whether static STAC records should be stored inline with their corresponding data or in a unique storage location. I feel the most flexible solution would be for bakeries to provide a storage target specifically for STAC records. With this in place bakery managers could manage catalogs for their own holdings while the pangeo-forge product could manage an overarching confederated catalog which references all of the bakery catalogs. The one complicating factor here will be the need for additional logic around PySTAC's StacIO to support inferring the correct read and write protocols for different cloud providers.
I suggest the building of STAC records should be a seperate CI workflow rather than being included directly as a method in pangeo-forge-recipes. With this approach we still have direct access to the meta.yaml as part of the CI context and we can support the cross organization record storage described above. One step would be responsible for submitting a flow to the bakery in order to build the STAC catalog and a second step would create the reference link for this catalog in the central pangeo-forge catalog.
I support the use of Github for initial STAC record storage prototyping but we should attempt to standup centralized object storage for STAC records for pangeo-forge and the bakeries to facilitate PySTAC I/O usage.

cisaacstern commented 2 years ago

Thanks for this incredibly helpful perspective, @sharkinsspatial. There's a lot to dig into, but one small point of clarification to start. Option A below is the Collection naming scheme as proposed in your table above. Is it indeed a STAC Best Practice to not store a Collection object within a subdirectory of its enclosing Catalog? Option B seems more intuitive to me, but of course just want to do whatever is considered mainstream within the ecosystem.

Option	Object Name	STAC Type	Example
A	`/pangeo-forge/{{ collection_name }}/collection.json`	Collection	`/pangeo-forge/gigatl/collection.json`
B	`/pangeo-forge/{{ enclosing_catalog_name }}/{{ collection_name }}/collection.json`	Collection	`/pangeo-forge/swot_adac/gigatl/collection.json`

TomAugspurger commented 2 years ago

I may be off-base, but https://github.com/radiantearth/stac-api-spec/issues/159 might be related to the catalog / collections layout discussion.

sharkinsspatial commented 2 years ago

@cisaacstern Apologies, that is a typo in my comment. Your nested collection structure is the correct name.

cisaacstern commented 2 years ago

Thanks to Tom for https://github.com/TomAugspurger/xstac/pull/11#event-5206242190 which will be of great help in generating STAC Items.

cisaacstern commented 2 years ago

[ ] Determine how & when (in the recipe workflow) we want to build STAC objects

I believe the best way to build this is as a standalone GitHub Action, to be called following completion of https://github.com/pangeo-forge/feedstock-creation-action here: staged-recipes/create-feedstock.yaml

A standalone Action repo should make local testing with https://github.com/nektos/act easier and means we can maintain/update/etc. xstac JSON templates without having to commit them to pangeo-forge/staged-recipes (NB: keeping the staged-recipes commit history clean is important, as mentioned in https://github.com/pangeo-forge/staged-recipes/pull/80#pullrequestreview-751698255).

Here is the WIP repo for this Action: https://github.com/pangeo-forge/stac-creation-action. Updates to follow shortly.

pangeo-forge / pangeo-forge-catalog

STAC catalog sprint: to-do items #1