How to return dynamic sub-catalogs

philvarner commented 1 year ago

Moved from:

https://github.com/radiantearth/stac-api-spec/issues/329

philvarner commented 1 year ago

Some text from stac-api-spec:

If sub-catalogs are used, it is recommended that these use the endpoint /catalogs/{catalogId} to avoid conflicting with other endpoints from the root.

Endpoint	Media Type	Returns	Description
`/catalogs/{catalogId}`	application/json	Catalog	child Catalog object

Structuring Catalog Hierarchies

A STAC API is more useful when it presents a complete Catalog representation of all the data contained in the API, such that all Item objects can be reached by traversing child and item link relations from the root. Being able to reach all Items in this way is formalized in the Browseable conformance class, but any Catalog can be structured for hierarchical traversal. Implementers who have search as their primary use case should consider also implementing this alternate view over the data by presenting it as a directed graph of catalogs, where the child link relations typically form a tree, and where each catalog can be retrieved with a single request (e.g., each Catalog JSON is small enough that it does not require pagination).

For example, child links to sub-catalogs may be structured as in this diagram:

graph LR
    A[Root] -->|child| B(sentinel-2-l2a)
    B --> |child| C(10SDG)
    B --> |child| D(10SDH)
    B --> |child| E(10SDJ)
    B --> |child| BB(...)

    C --> |child| F(2018)
    C --> |child| G(2019)
    C --> |child| CC(...)

    D --> |child| H(2018)
    D --> |child| DD(...)
    E --> |child| I(2018)
    E --> |child| EE(...)

    F --> |item| J(12.31.0)
    F --> |item| K(01.09.0)
    F --> |item| L(01.09.1)
    F --> |item| FF(...)

STAC API does not define what endpoint or endpoints should returns these catalogs, but approach would be to return them from an endpoint like /catalogs/{catalogId}.

While OAFeat requires that all Items must be part of a Collection, this does not mean that the Collection needs to be part of the browseable tree. If they are part of the tree, it is recommended that there only be one Collection in a path through the tree, and that a collection never contain child collections.

These are the two standard ways of structuring a browseable tree of catalogs, the only difference being whether the Collection is used as part of the tree or not:

Catalog (root) -> Catalog* -> Item (recommended)
Catalog (root) -> Collection -> Catalog* -> Item

All items must be part of a Collection, but the Collection itself does not need to be part of the browsable graph.

How you structure your graph of Catalogs can allow you to both group Collections together and create sub-groups of items within a Collection. For example, your collections may be grouped so each represent a data product. This might mean you have a collection for each of Landsat 8 Collection 1, Landsat 8 Surface Reflectance, Sentinel-2 L1C, Sentinel-2 L2A, Sentinel-5P UV Aerosol Index, Sentinel-5P Cloud, MODIS MCD43A4, MODIS MOD11A1, and MODIS MYD11A1. You can also present each of these as a catalog, and create parent catalogs for them that allow you to group together all Landsat, Sentinel, and MODIS catalogs.

/ root catalog
- child -> /catalogs/landsat
- child -> /catalogs/landsat_7
- child -> /catalogs/landsat_8
  - child -> /catalogs/landsat_8_c1
  - child -> /catalogs/landsat_8_sr
- child -> /catalogs/sentinel
- child -> /catalogs/sentinel_2
  - child -> /catalogs/sentinel_2_l1c
  - child -> /catalogs/sentinel_2_l2a
- child -> /catalogs/sentinel_5p
  - child -> /catalogs/sentinel_5p_uvai
  - child -> /catalogs/sentinel_5p_cloud
- child -> /catalogs/modis
- child -> /catalogs/modis_mcd43a4
- child -> /catalogs/modis_mod11a1
- child -> /catalogs/modis_myd11a1

Each of these catalog endpoints could in turn be its own STAC API root, allowing an interface where users can search over arbitrary groups of collections without needing to explicitly know and name every collection in the search collection query parameter. These catalogs-of-catalogs can be separated multiple ways, e.g. be per provider (e.g., Sentinel-2), per domain (e.g., cloud data), or per form of data (electro-optical, LIDAR, SAR).

Going the other direction, collections can be sub-grouped into smaller catalogs. For example, this example groups a catalog of Landsat 8 Collection 1 items by path, row, and date (the path/row system is used by this product for gridding).

/ (root)
- /catalogs/landsat_8_c1
- /catalogs/landsat_8_c1/139
  - /catalogs/landsat_8_c1/139_045
  - /catalogs/landsat_8_c1/139_045_20170304
    - /collections/landsat_8_c1/items/LC08_L1TP_139045_20170304_20170316_01_T1
  - /catalogs/landsat_8_c1/139_045_20170305
    - /collections/landsat_8_c1/items/LC08_L1TP_139045_20170305_20170317_01_T1
  - /catalogs/landsat_8_c1/139_046
  - /catalogs/landsat_8_c1/139_046_20170304
    - /collections/landsat_8_c1/items/LC08_L1TP_139046_20170304_20170316_01_T1
  - /catalogs/landsat_8_c1/139_046_20170305
    - /collections/landsat_8_c1/items/LC08_L1TP_139046_20170305_20170317_01_T1

If done in a consistent manner, these can also provide "templated" URIs, such that a user could directly request a specific path, row, and date simply by replacing the values in /catalogs/landsat_8_c1/{path}_{row}_{date}.

Similarly, a MODIS product using sinusoidal gridding could use paths of the form /{horizontal_grid}/{vertical_grid}/{date}. Since only around 300 scenes produced every day for a MODIS product and there is a 20 year history of production, these could be fit in a graph with path length 3 from the root Catalog to each leaf Item.

/ (root)
- /catalogs/mcd43a4 (~7,000 child relation links, one to each date)
- /catalogs/mcd43a4/{date} (~300 item relation links to each Item)
  - /collections/mcd43a4/items/{itemId}
  - ...

Catalogs can also group related products. For example, here we group together synthetic aperture radar (SAR) products (Sentinel-1 and AfriSAR) and electro-optical (EO) bottom of atmosphere (BOA) products.

/ root catalog
- child -> /catalogs/sar
- child -> /catalogs/sentinel_1_l2a
- child -> /catalogs/afrisar
- child -> /catalogs/eo_boa
- child -> /catalogs/landsat_8_sr
- child -> /catalogs/sentinel_2_l2a

The catalogs structure is a directed graph that allows you to provide numerous different Catalog and Collection graphs to reach leaf Items. For example, for a Landsat 8 data product, you may want to allow browsing both by date then path then row, or by path then row then date:

Catalog -> Catalog (product) -> Catalog (date) -> Catalog (path) -> Catalog (row)
Catalog -> Catalog (product) -> Catalog (path) -> Catalog (row) -> Catalog (date)

When more than path to an Item is allowed, it is recommended that the final item link relation reference a consistent, canonical URL for each item, instead of a URL that is specific to the path of Catalog that was followed to reach it.

There are many options for how to structure these catalog graphs, so it will take some analysis work to figure out which one or ones best match the structure of your data and the needs of your consumers.

chiarch84 commented 1 year ago

Dear @philvarner I read in detail what you propose but I do not have clear why you propose the following paths

Catalog (root) -> Catalog* -> Item (recommended)
Catalog (root) -> Collection -> Catalog* -> Item

Rather than:

Catalog (root) -> Catalog* -> Collection -> Item (recommended)
Catalog (root) -> Collection -> Item

From my point of view the following tree structure should work well. Do you think it has something that clashes with the specs?

/
- child -> /catalogs/sentinel
  - child -> /catalogs/sentinel_2
    - child -> /collections/sentinel_2_l1c
      - items -> /collections/sentinel_2_l1c/items
    - child -> /collections/sentinel_2_l2a
      - items -> /collections/sentinel_2_l2a/items
  - child -> /catalogs/sentinel_1
    - child -> /collections/sentinel_1_l1a
      - items -> /collections/sentinel_1_l1a/items
    - child -> /collections/sentinel_1_l1c
      - items -> /collections/sentinel_1_l1c/items

stac-api-extensions / best-practices

How to return dynamic sub-catalogs #4

Structuring Catalog Hierarchies