pangeo-forge / pangeo-forge-catalog

SpatioTemporal Asset Catalog (STAC) for Pangeo Forge.
Apache License 2.0
0 stars 1 forks source link

STAC catalog sprint: to-do items #1

Open cisaacstern opened 2 years ago

cisaacstern commented 2 years ago

cc @rabernat, @sharkinsspatial, @TomAugspurger so you're aware of current progress on this

TomAugspurger commented 2 years ago

I thought I left a comment here, but apparently not. The gist of it was about the item:

For extracting metadata from xarray and building pystac-backed STAC objects, https://github.com/TomAugspurger/xstac is a good baseline, but may need extension for our specific use case(s)

The datacube STAC extension applies to both Collections and Items. I think that xstac could be updated pretty easily to support items too. Let me know if you're interested in working on that.

cisaacstern commented 2 years ago

Let me know if you're interested in working on that.

Thanks, Tom. I definitely am. Once I get a little closer to figuring out what functionality we need, I'll follow up directly on xstac with an Issue or draft PR for discussion.

rabernat commented 2 years ago

This is a great list. I'd love to try to help however I can. A few comments.

  • It seems that current practice may already be deviating somewhat from the target layout structure as defined above.

We should fix this! We are still very early in this project. We don't need to worry about preserving bad decisions to maintain some vague backwards compatibility. We should get consistent by either changing the file paths or changing the spec. pangeo-forge/swot-adac/{feedstock_name}/{recipe_name}.zarr is out of spec with ADR 3. There is one outstanding question in ADR 3, which we may now be ready to revisit: do we want to allow recipe_name to include "sub-directories"?

  • I have yet to confirm whether STAC Catalogs can wrap other Catalogs

Yes they can. You can have as many layers of nesting as you want, by linking to other catalogs from catalogs. See STAC catalog overview.

  • I believe following the CEDA approach of representing Zarr stores as STAC Items does makes sense for our use case,

:+1: from me

  • A database with an API endpoint makes the most sense middle-term, but on the very short term, a simpler solution which allows us to hammer-out other points on this list is to store prototype catalogs on GitHub.

:+1: from me

  • The pystac-based loading linked above currently appears to be the best way to open zarr datasets from STAC

Can pystac load Zarr STAC items?

cisaacstern commented 2 years ago

Yes they can. You can have as many layers of nesting as you want, by linking to other catalogs from catalogs. See STAC catalog overview.

Yeah! Figured this out somewhere along the way too. My current thought (formal write-up to follow, which can tie in your ADR notes above) ... is that our mapping should be:

Layer Object Name STAC Type Example
top pangeo-forge-catalog.json Catalog pangeo-forge-catalog.json
middle-high {{ feedstock_name }}-catalog.json Catalog swot-adac-catalog.json
middle-low {{ collection_name }}-collection.json Collection gigatl-collection.json
bottom {{ dataset_unique_identifier }}.json Item region01-surf-fma.json

This is the pattern I've followed in https://github.com/pangeo-forge/pangeo-forge-catalog/tree/dev/stac (which I just pushed, for discussion purposes).

Can pystac load Zarr STAC items?

It requires a few lines (or wrapping them in a function), but yeah! Check this out: https://nbviewer.jupyter.org/github/cisaacstern/stac-notebooks/blob/gigatl-reg01-surf-fma/example_notebook.ipynb (where opener is defined here).

cisaacstern commented 2 years ago
  • [x] Clarify how (technically and aesthetically) we want to run a STAC Browser instance

A prototyped version of this was completed by https://github.com/pangeo-forge/pangeo-forge-vue-website/pull/6 and should be usable at https://pangeo-forge.org/catalog once Netlify rebuilds. As noted in my last comment on that PR, this code will likely need to be refactored once STAC Browser components become installable directly from npm.

  • [x] Consider staged rollout of where/how STAC objects are stored

With Ryan's thumbs up on using GItHub to start, I'm going to consider that the agreed-upon approach for the time being. This is where I'm directing our STAC Browser to the Pangeo Forge root catalog: https://github.com/pangeo-forge/pangeo-forge-vue-website/blob/main/src/main.js#L14. This can be changed over to a database-backed API whenever we see fit.

  • [ ] Customizing the STAC Browser instance

I've recorded a number of specific points for consideration on this topic here: https://github.com/pangeo-forge/pangeo-forge-vue-website/pull/6#issuecomment-888875862.

Ok! More to follow tomorrow. That was a big push and I think I'm ready for a break. πŸ˜…

rabernat commented 2 years ago

That was a big push and I think I'm ready for a break

Well deserved! πŸ† for pushing this difficult and uncertain task forward with minimal guidance. πŸ‘ πŸ‘ πŸ‘

cisaacstern commented 2 years ago

Following https://github.com/pangeo-forge/pangeo-forge-vue-website/pull/7, prototype catalog is now up at https://pangeo-forge.org/catalog#/.

sharkinsspatial commented 2 years ago

This is great @cisaacstern 🎊 . Apologies for the radio silence as you have been building this out as I've been out of the office (STAC is the one area of this project where I actually have some experience and might be able to contribute :]). I'll try to address points in the order described in your initial comment.

Layer Object Name STAC Type Example
top catalog.json Catalog /pangeo-forge/catalog.json
middle-high /pangeo-forge/{{ feedstock_name }}/catalog.json Catalog /pangeo-forge/swot-adac/catalog.json
middle-low /pangeo-forge/{{ collection_name }}/collection.json Collection /pangeo-forge/gigatl/collection.json
bottom {{ dataset_unique_identifier }}.json Item region01-surf-fma.json
cisaacstern commented 2 years ago

Thanks for this incredibly helpful perspective, @sharkinsspatial. There's a lot to dig into, but one small point of clarification to start. Option A below is the Collection naming scheme as proposed in your table above. Is it indeed a STAC Best Practice to not store a Collection object within a subdirectory of its enclosing Catalog? Option B seems more intuitive to me, but of course just want to do whatever is considered mainstream within the ecosystem.

Option Object Name STAC Type Example
A /pangeo-forge/{{ collection_name }}/collection.json Collection /pangeo-forge/gigatl/collection.json
B /pangeo-forge/{{ enclosing_catalog_name }}/{{ collection_name }}/collection.json Collection /pangeo-forge/swot_adac/gigatl/collection.json
TomAugspurger commented 2 years ago

I may be off-base, but https://github.com/radiantearth/stac-api-spec/issues/159 might be related to the catalog / collections layout discussion.

sharkinsspatial commented 2 years ago

@cisaacstern Apologies, that is a typo in my comment. Your nested collection structure is the correct name.

cisaacstern commented 2 years ago

Thanks to Tom for https://github.com/TomAugspurger/xstac/pull/11#event-5206242190 which will be of great help in generating STAC Items.

cisaacstern commented 2 years ago
  • [ ] Determine how & when (in the recipe workflow) we want to build STAC objects

I believe the best way to build this is as a standalone GitHub Action, to be called following completion of https://github.com/pangeo-forge/feedstock-creation-action here: staged-recipes/create-feedstock.yaml

A standalone Action repo should make local testing with https://github.com/nektos/act easier and means we can maintain/update/etc. xstac JSON templates without having to commit them to pangeo-forge/staged-recipes (NB: keeping the staged-recipes commit history clean is important, as mentioned in https://github.com/pangeo-forge/staged-recipes/pull/80#pullrequestreview-751698255).

Here is the WIP repo for this Action: https://github.com/pangeo-forge/stac-creation-action. Updates to follow shortly.