Open cisaacstern opened 2 years ago
I thought I left a comment here, but apparently not. The gist of it was about the item:
For extracting metadata from xarray and building pystac-backed STAC objects, https://github.com/TomAugspurger/xstac is a good baseline, but may need extension for our specific use case(s)
The datacube
STAC extension applies to both Collections and Items. I think that xstac could be updated pretty easily to support items too. Let me know if you're interested in working on that.
Let me know if you're interested in working on that.
Thanks, Tom. I definitely am. Once I get a little closer to figuring out what functionality we need, I'll follow up directly on xstac
with an Issue or draft PR for discussion.
This is a great list. I'd love to try to help however I can. A few comments.
- It seems that current practice may already be deviating somewhat from the target layout structure as defined above.
We should fix this! We are still very early in this project. We don't need to worry about preserving bad decisions to maintain some vague backwards compatibility. We should get consistent by either changing the file paths or changing the spec. pangeo-forge/swot-adac/{feedstock_name}/{recipe_name}.zarr
is out of spec with ADR 3. There is one outstanding question in ADR 3, which we may now be ready to revisit: do we want to allow recipe_name
to include "sub-directories"?
- I have yet to confirm whether STAC Catalogs can wrap other Catalogs
Yes they can. You can have as many layers of nesting as you want, by linking to other catalogs from catalogs. See STAC catalog overview.
- I believe following the CEDA approach of representing Zarr stores as STAC Items does makes sense for our use case,
:+1: from me
- A database with an API endpoint makes the most sense middle-term, but on the very short term, a simpler solution which allows us to hammer-out other points on this list is to store prototype catalogs on GitHub.
:+1: from me
- The
pystac
-based loading linked above currently appears to be the best way to open zarr datasets from STAC
Can pystac load Zarr STAC items?
Yes they can. You can have as many layers of nesting as you want, by linking to other catalogs from catalogs. See STAC catalog overview.
Yeah! Figured this out somewhere along the way too. My current thought (formal write-up to follow, which can tie in your ADR notes above) ... is that our mapping should be:
Layer | Object Name | STAC Type | Example |
---|---|---|---|
top | pangeo-forge-catalog.json |
Catalog | pangeo-forge-catalog.json |
middle-high | {{ feedstock_name }}-catalog.json |
Catalog | swot-adac-catalog.json |
middle-low | {{ collection_name }}-collection.json |
Collection | gigatl-collection.json |
bottom | {{ dataset_unique_identifier }}.json |
Item | region01-surf-fma.json |
This is the pattern I've followed in https://github.com/pangeo-forge/pangeo-forge-catalog/tree/dev/stac (which I just pushed, for discussion purposes).
Can pystac load Zarr STAC items?
It requires a few lines (or wrapping them in a function), but yeah! Check this out: https://nbviewer.jupyter.org/github/cisaacstern/stac-notebooks/blob/gigatl-reg01-surf-fma/example_notebook.ipynb (where opener
is defined here).
- [x] Clarify how (technically and aesthetically) we want to run a STAC Browser instance
A prototyped version of this was completed by https://github.com/pangeo-forge/pangeo-forge-vue-website/pull/6 and should be usable at https://pangeo-forge.org/catalog once Netlify rebuilds. As noted in my last comment on that PR, this code will likely need to be refactored once STAC Browser components become installable directly from npm
.
- [x] Consider staged rollout of where/how STAC objects are stored
With Ryan's thumbs up on using GItHub to start, I'm going to consider that the agreed-upon approach for the time being. This is where I'm directing our STAC Browser to the Pangeo Forge root catalog: https://github.com/pangeo-forge/pangeo-forge-vue-website/blob/main/src/main.js#L14. This can be changed over to a database-backed API whenever we see fit.
- [ ] Customizing the STAC Browser instance
I've recorded a number of specific points for consideration on this topic here: https://github.com/pangeo-forge/pangeo-forge-vue-website/pull/6#issuecomment-888875862.
Ok! More to follow tomorrow. That was a big push and I think I'm ready for a break. π
That was a big push and I think I'm ready for a break
Well deserved! π for pushing this difficult and uncertain task forward with minimal guidance. π π π
Following https://github.com/pangeo-forge/pangeo-forge-vue-website/pull/7, prototype catalog is now up at https://pangeo-forge.org/catalog#/.
This is great @cisaacstern π . Apologies for the radio silence as you have been building this out as I've been out of the office (STAC is the one area of this project where I actually have some experience and might be able to contribute :]). I'll try to address points in the order described in your initial comment.
Layer | Object Name | STAC Type | Example |
---|---|---|---|
top | catalog.json |
Catalog | /pangeo-forge/catalog.json |
middle-high | /pangeo-forge/{{ feedstock_name }}/catalog.json |
Catalog | /pangeo-forge/swot-adac/catalog.json |
middle-low | /pangeo-forge/{{ collection_name }}/collection.json |
Collection | /pangeo-forge/gigatl/collection.json |
bottom | {{ dataset_unique_identifier }}.json |
Item | region01-surf-fma.json |
We are in a somewhat unique position because we creating a catalog referencing subcatalogs for data hosted on a variety of cloud storage providers (there is a legacy issue discussing this case in stac-spec
that I will try to locate). There are open questions about whether static STAC records should be stored inline with their corresponding data or in a unique storage location. I feel the most flexible solution would be for bakeries to provide a storage target specifically for STAC records. With this in place bakery managers could manage catalogs for their own holdings while the pangeo-forge product could manage an overarching confederated catalog which references all of the bakery catalogs. The one complicating factor here will be the need for additional logic around PySTAC's StacIO
to support inferring the correct read and write protocols for different cloud providers.
I suggest the building of STAC records should be a seperate CI workflow rather than being included directly as a method in pangeo-forge-recipes
. With this approach we still have direct access to the meta.yaml
as part of the CI context and we can support the cross organization record storage described above. One step would be responsible for submitting a flow to the bakery in order to build the STAC catalog and a second step would create the reference link for this catalog in the central pangeo-forge catalog.
I support the use of Github for initial STAC record storage prototyping but we should attempt to standup centralized object storage for STAC records for pangeo-forge
and the bakeries to facilitate PySTAC I/O usage.
Thanks for this incredibly helpful perspective, @sharkinsspatial. There's a lot to dig into, but one small point of clarification to start. Option A below is the Collection naming scheme as proposed in your table above. Is it indeed a STAC Best Practice to not store a Collection object within a subdirectory of its enclosing Catalog? Option B seems more intuitive to me, but of course just want to do whatever is considered mainstream within the ecosystem.
Option | Object Name | STAC Type | Example |
---|---|---|---|
A | /pangeo-forge/{{ collection_name }}/collection.json |
Collection | /pangeo-forge/gigatl/collection.json |
B | /pangeo-forge/{{ enclosing_catalog_name }}/{{ collection_name }}/collection.json |
Collection | /pangeo-forge/swot_adac/gigatl/collection.json |
I may be off-base, but https://github.com/radiantearth/stac-api-spec/issues/159 might be related to the catalog / collections layout discussion.
@cisaacstern Apologies, that is a typo in my comment. Your nested collection
structure is the correct name.
Thanks to Tom for https://github.com/TomAugspurger/xstac/pull/11#event-5206242190 which will be of great help in generating STAC Items.
- [ ] Determine how & when (in the recipe workflow) we want to build STAC objects
I believe the best way to build this is as a standalone GitHub Action, to be called following completion of https://github.com/pangeo-forge/feedstock-creation-action here: staged-recipes/create-feedstock.yaml
A standalone Action repo should make local testing with https://github.com/nektos/act easier and means we can maintain/update/etc. xstac
JSON templates without having to commit them to pangeo-forge/staged-recipes
(NB: keeping the staged-recipes commit history clean is important, as mentioned in https://github.com/pangeo-forge/staged-recipes/pull/80#pullrequestreview-751698255).
Here is the WIP repo for this Action: https://github.com/pangeo-forge/stac-creation-action. Updates to follow shortly.
{prefix}/pangeo-forge/{feedstock_name}/{recipe_name}.{dataset_type}
pangeo-forge
level should be represented by a STAC Catalogpangeo-forge
and{feedstock_name}
corresponding to the project or, we might say, "collection" to which the recipe belongs. For example, in the case of theswot-adac
project, each feedstock (i.e. PR repo corresponding to a particular model output), is stored with a target path in the style ofpangeo-forge/swot-adac/{feedstock_name}/{recipe_name}.zarr
build_stac
method or similar within theXarrayZarrRecipe
class, and that this method is something which we'll want to call at recipe build time (afterfinalize_target
), so that it can draw upon additional metadata fields (either pre-existing, or which we will add for this purpose) in the recipe'smeta.yaml
file. While certain cataloging information can be extracted from the zarr metadata directly via xarray (e.g., spatiotemporal extent, variable names, etc.), other key metadata for building expressive STAC objects will need to be fed from outside the dataset (i.e., long-form description, provider url, license, etc.). Much of this is already captured inmeta.yaml
, so this is a natural place to pull it from.pystac
-backed STAC objects, https://github.com/TomAugspurger/xstac is a good baseline, but may need extension for our specific use case(s)pystac
's type checker), which can be opened in a notebook as demonstrated here. This loading syntax is referenced from https://planetarycomputer.microsoft.com/dataset/daymet-monthly-hi#Example-Notebook and discussed further in the next to-do item.pystac
-based loading linked above currently appears to be the best way to open zarr datasets from STACintake-stac
is under discussion here https://github.com/intake/intake-stac/pull/90cc @rabernat, @sharkinsspatial, @TomAugspurger so you're aware of current progress on this