STAC collection to Product definition

Juanezm commented 3 years ago

Greetings,

I'm trying to add some product definition into the odc using the STAC collection definition, similarly to what https://github.com/opendatacube/odc-tools/blob/4c017f6bb846e950d6889c476ae840bd223f541a/libs/index/odc/index/stac.py#L199 is doing for STAC items and odc datasets.

Is there any library/tool for doing so?

Thanks, Juan

Kirill888 commented 3 years ago

I'm starting to work on this now. Main issue with STAC Collection -> Datacube Product mapping is lack of pixel type per band information in STAC collection that Datacube Product requires.

One way to work around this issue is to inspect STAC Assets with an assumption that assets with the same name are consistent across STAC Items. Other option is to ask user for that information, for example user might provide a dictionary from band name to dtype of the pixel data, or just a single dtype if all the bands share the same.

Other piece of information that is needed is "fill value" for each band, what value to use for pixels not covered by any dataset. Same story here, one can either lookup nodata attribute in the data file (if it is set), or ask user to provide, or default to some reasonable value based on dtype of the pixel

Juanezm commented 3 years ago

Yes, it is tricky. I finally ended up adding a custom stac_extension object into the collection since I was creating the collections too, but this work around won't work with generic collections.

Kirill888 commented 3 years ago

Ideally pixel data type would be part of this extension:

https://github.com/stac-extensions/eo#band-object

@gadomski was suggesting we propose additions to eo:bands extension for that purpose, cc: @alexgleith

Would be nice to have information for every data band in a collection (without needing to fetch sample pixel imagery)

dtype of the pixel data as stored on "disk"
reasonable value to use for fill value when returning data outside of covered area, typically this would be the same as nodata value used in the images

JonDHo commented 3 years ago

Apologies if this is a slightly different aspect of this issue, but I am running into issues that relate to the STAC Collection -> Datacube Product mapping space, so I thought I would add a comment.

In the USGS STAC API - https://ibhoyw8md9.execute-api.us-west-2.amazonaws.com/prod or https://landsatlook.usgs.gov/stac-server/ each Collection is a combination of all Landsat platforms, with of course clear differences between LS5, LS7, LS8, etc.

For example LS7 vs LS8 both part of the same collection but of course with LS7 missing coastal aerosol and the other obvious differences.

As an initial workaround, I am successfully overriding the process_item() function in stac_api_to_odc.py from my code where I can intercept the meta['product']['name'] after it is created by item_to_meta_uri(). This is functional enough for my prototyping purposes now, but I can see that the ability to map collections (e.g. with platform being specified in the query params) to a specific ODC product will be very useful.

alexgleith commented 3 years ago

For Digital Earth Africa, we added a field odc:product to the USGS STAC records, and split the collection into 6 products, LS5, LS7, LS8 + _SR or _ST.

The reasoning is that the ODC doesn't handle missing bands by default, and since some scenes have SR and some have ST and some have both, we needed two products there. And as you note, some bands are both different with the same name, or added, between platforms.

Providing a way to split a collection into different products would be nice. I don't know the best way to do it, aside from hard-coding it. Or by encouraging the adoption of an ODC Extension for STAC, which could specify it.

Kirill888 commented 3 years ago

I think the easiest place to fix this issue is in datacube-core. We need to support "band is known to be absent for this dataset" case. It could be as simple as:

coastal:
  path: _

to indicate that this dataset is "aware" of the coastal band, so it would match the product, but actual data is absent.

JonDHo commented 3 years ago

For Digital Earth Africa, we added a field odc:product to the USGS STAC records, and split the collection into 6 products, LS5, LS7, LS8 + _SR or _ST.

The reasoning is that the ODC doesn't handle missing bands by default, and since some scenes have SR and some have ST and some have both, we needed two products there. And as you note, some bands are both different with the same name, or added, between platforms.

Providing a way to split a collection into different products would be nice. I don't know the best way to do it, aside from hard-coding it. Or by encouraging the adoption of an ODC Extension for STAC, which could specify it.

Yes, this is effectively what I am going to do, but am doing so by changing the product name and then passing the relevant query params (e.g. platform=LANDSAT_8) to the STAC API. That way I can keep using the scripts that you guys have put together and get running quickly. For me, the model of passing a specific query to an API and then telling it exactly which product that I would like to add it to fits well and gives a suitable level of control.

Having the ability to permit absent bands (in a controlled way) could be useful though as well.

JonDHo commented 3 years ago

This is what I am doing (please note that this is not production code, and is highly likely to break, but just demonstrates the use case):

from datacube import Datacube
from datacube.index.hl import Doc2Dataset
import odc.apps.dc_tools.stac_api_to_dc as stac
from typing import Optional, Tuple

product_override = 'some_product_name'

<< OTHER CODE >>

    def process_item_new(
        item: Item,
        dc: Datacube,
        doc2ds: Doc2Dataset,
        update_if_exists: bool,
        allow_unsafe: bool,
        rewrite: Optional[Tuple[str, str]] = None,
    ):
        meta, uri = stac.item_to_meta_uri(item, rewrite)
        meta['product']['name'] = product_override
        odcutils.index_update_dataset(
            meta,
            uri,
            dc,
            doc2ds,
            update_if_exists=update_if_exists,
            allow_unsafe=allow_unsafe,
        )

    if product_override:
      stac.process_item = process_item_new

    success, failure = stac.stac_api_to_odc(
      dc=dc,
      update_if_exists=update,
      config=config,
      catalog_href=stac_api_url,
      allow_unsafe=allow_unsafe
    )

JonDHo commented 3 years ago

I just realised that the handling of the product name for Sentinel 2 from the collection name of "sentinel-s2-l2a-cogs" (in the Element84 API) to "s2_l2a" is hard-coded in transform.py. Just a note that this may not work for all, highlighting the need to be able to provide a custom product name prior to passing to stac_api_to_odc - https://github.com/opendatacube/odc-tools/blob/237692e5bdc06a511859c104d0718394e5505bf7/libs/stac/odc/stac/transform.py#L92

alexgleith commented 3 years ago

Hi @JonDHo yes, this was hardcoded a long time ago...

I think you're right, that we should enable passing in a product name. I'm not sure how to achieve it, currently, but it's a good idea.

JonDHo commented 2 years ago

This is my current solution which lets me continue to use the majority of the stac_api_to_dc functions without having to fork and maintain. I simply override the process_item function of odc.apps.dc_tools.stac_api_to_dc. It would be useful of course if a custom product name could be passed into the process and handled internally in the same way.

import odc.apps.dc_tools.stac_api_to_dc as stac
product_override = "my_custom_name" # In reality is passed in as an Argo variable
### LOTS OF OTHER STUFF ###
def process_item_new(
    item: Item,
    dc: Datacube,
    doc2ds: Doc2Dataset,
    update_if_exists: bool,
    allow_unsafe: bool,
    rewrite: Optional[Tuple[str, str]] = None,
):
    meta, uri = stac.item_to_meta_uri(item, rewrite)
    meta['product']['name'] = product_override # Replace the product name after the meta object has been created
    odcutils.index_update_dataset(
        meta,
        uri,
        dc,
        doc2ds,
        update_if_exists=update_if_exists,
        allow_unsafe=allow_unsafe,
    )

# If a value has been provided to the product_override variable, swap out the function.
if product_override:
  stac.process_item = process_item_new

alexgleith commented 2 years ago

Hey @JonDHo it shouldn't be too hard to make that an option on the CLI. Feel free to send a PR to make the change.

opendatacube / odc-tools

STAC collection to Product definition #148