Open Juanezm opened 3 years ago
I'm starting to work on this now. Main issue with STAC Collection -> Datacube Product
mapping is lack of pixel type per band information in STAC collection that Datacube Product requires.
One way to work around this issue is to inspect STAC Assets with an assumption that assets with the same name are consistent across STAC Items. Other option is to ask user for that information, for example user might provide a dictionary from band name to dtype
of the pixel data, or just a single dtype
if all the bands share the same.
Other piece of information that is needed is "fill value" for each band, what value to use for pixels not covered by any dataset. Same story here, one can either lookup nodata
attribute in the data file (if it is set), or ask user to provide, or default to some reasonable value based on dtype of the pixel
Yes, it is tricky. I finally ended up adding a custom stac_extension
object into the collection since I was creating the collections too, but this work around won't work with generic collections.
Ideally pixel data type would be part of this extension:
https://github.com/stac-extensions/eo#band-object
@gadomski was suggesting we propose additions to eo:bands
extension for that purpose, cc: @alexgleith
Would be nice to have information for every data band in a collection (without needing to fetch sample pixel imagery)
dtype
of the pixel data as stored on "disk"fill value
when returning data outside of covered area, typically this would be the same as nodata
value used in the imagesApologies if this is a slightly different aspect of this issue, but I am running into issues that relate to the STAC Collection -> Datacube Product
mapping space, so I thought I would add a comment.
In the USGS STAC API - https://ibhoyw8md9.execute-api.us-west-2.amazonaws.com/prod or https://landsatlook.usgs.gov/stac-server/ each Collection is a combination of all Landsat platforms, with of course clear differences between LS5, LS7, LS8, etc.
For example LS7 vs LS8 both part of the same collection but of course with LS7 missing coastal aerosol and the other obvious differences.
As an initial workaround, I am successfully overriding the process_item()
function in stac_api_to_odc.py from my code where I can intercept the meta['product']['name']
after it is created by item_to_meta_uri()
. This is functional enough for my prototyping purposes now, but I can see that the ability to map collections (e.g. with platform being specified in the query params) to a specific ODC product will be very useful.
For Digital Earth Africa, we added a field odc:product
to the USGS STAC records, and split the collection into 6 products, LS5, LS7, LS8 + _SR or _ST.
The reasoning is that the ODC doesn't handle missing bands by default, and since some scenes have SR and some have ST and some have both, we needed two products there. And as you note, some bands are both different with the same name, or added, between platforms.
Providing a way to split a collection into different products would be nice. I don't know the best way to do it, aside from hard-coding it. Or by encouraging the adoption of an ODC Extension for STAC, which could specify it.
I think the easiest place to fix this issue is in datacube-core
. We need to support "band is known to be absent for this dataset" case. It could be as simple as:
coastal:
path: _
to indicate that this dataset is "aware" of the coastal
band, so it would match the product, but actual data is absent.
For Digital Earth Africa, we added a field
odc:product
to the USGS STAC records, and split the collection into 6 products, LS5, LS7, LS8 + _SR or _ST.The reasoning is that the ODC doesn't handle missing bands by default, and since some scenes have SR and some have ST and some have both, we needed two products there. And as you note, some bands are both different with the same name, or added, between platforms.
Providing a way to split a collection into different products would be nice. I don't know the best way to do it, aside from hard-coding it. Or by encouraging the adoption of an ODC Extension for STAC, which could specify it.
Yes, this is effectively what I am going to do, but am doing so by changing the product name and then passing the relevant query params (e.g. platform=LANDSAT_8) to the STAC API. That way I can keep using the scripts that you guys have put together and get running quickly. For me, the model of passing a specific query to an API and then telling it exactly which product that I would like to add it to fits well and gives a suitable level of control.
Having the ability to permit absent bands (in a controlled way) could be useful though as well.
This is what I am doing (please note that this is not production code, and is highly likely to break, but just demonstrates the use case):
from datacube import Datacube
from datacube.index.hl import Doc2Dataset
import odc.apps.dc_tools.stac_api_to_dc as stac
from typing import Optional, Tuple
product_override = 'some_product_name'
<< OTHER CODE >>
def process_item_new(
item: Item,
dc: Datacube,
doc2ds: Doc2Dataset,
update_if_exists: bool,
allow_unsafe: bool,
rewrite: Optional[Tuple[str, str]] = None,
):
meta, uri = stac.item_to_meta_uri(item, rewrite)
meta['product']['name'] = product_override
odcutils.index_update_dataset(
meta,
uri,
dc,
doc2ds,
update_if_exists=update_if_exists,
allow_unsafe=allow_unsafe,
)
if product_override:
stac.process_item = process_item_new
success, failure = stac.stac_api_to_odc(
dc=dc,
update_if_exists=update,
config=config,
catalog_href=stac_api_url,
allow_unsafe=allow_unsafe
)
I just realised that the handling of the product name for Sentinel 2 from the collection name of "sentinel-s2-l2a-cogs" (in the Element84 API) to "s2_l2a" is hard-coded in transform.py. Just a note that this may not work for all, highlighting the need to be able to provide a custom product name prior to passing to stac_api_to_odc - https://github.com/opendatacube/odc-tools/blob/237692e5bdc06a511859c104d0718394e5505bf7/libs/stac/odc/stac/transform.py#L92
Hi @JonDHo yes, this was hardcoded a long time ago...
I think you're right, that we should enable passing in a product name. I'm not sure how to achieve it, currently, but it's a good idea.
This is my current solution which lets me continue to use the majority of the stac_api_to_dc functions without having to fork and maintain. I simply override the process_item
function of odc.apps.dc_tools.stac_api_to_dc
. It would be useful of course if a custom product name could be passed into the process and handled internally in the same way.
import odc.apps.dc_tools.stac_api_to_dc as stac
product_override = "my_custom_name" # In reality is passed in as an Argo variable
### LOTS OF OTHER STUFF ###
def process_item_new(
item: Item,
dc: Datacube,
doc2ds: Doc2Dataset,
update_if_exists: bool,
allow_unsafe: bool,
rewrite: Optional[Tuple[str, str]] = None,
):
meta, uri = stac.item_to_meta_uri(item, rewrite)
meta['product']['name'] = product_override # Replace the product name after the meta object has been created
odcutils.index_update_dataset(
meta,
uri,
dc,
doc2ds,
update_if_exists=update_if_exists,
allow_unsafe=allow_unsafe,
)
# If a value has been provided to the product_override variable, swap out the function.
if product_override:
stac.process_item = process_item_new
Hey @JonDHo it shouldn't be too hard to make that an option on the CLI. Feel free to send a PR to make the change.
Greetings,
I'm trying to add some product definition into the odc using the STAC collection definition, similarly to what https://github.com/opendatacube/odc-tools/blob/4c017f6bb846e950d6889c476ae840bd223f541a/libs/index/odc/index/stac.py#L199 is doing for STAC items and odc datasets.
Is there any library/tool for doing so?
Thanks, Juan