stac-utils / xpystac

For extending xarray.open_dataset to accept pystac objects
MIT License
32 stars 2 forks source link

Switch from stackstc to odc-stac? #12

Closed jsignell closed 1 year ago

jsignell commented 1 year ago

It seems like there is a shift towards using odc-stac rather than stackstac. I'm wondering if that needs to be configurable somehow or if this library should just pick one.

gadomski commented 1 year ago

IMO the community is still unsure of which (if either) is The One™. If you do integrate/switch, I would love to hear your thoughts on the comparison between the two w.r.t. ease of integration, ease of use, etc.

jsignell commented 1 year ago

Yeah I read through https://github.com/opendatacube/odc-stac/issues/54 and came out the other end thinking that odc-stac probably has more of a future. I'll see if I can com e up with any ideas around how to make odc-stac more ergonomic.

weiji14 commented 1 year ago

Maybe have both? Currently, stackstac produces an xarray.DataArray whereas odc-stac produces an xarray.Dataset. An xr.DataArray is suited for 2D data + bands, whereas an xr.Dataset is suited for multi-dimensional datasets (e.g. climate model outputs), so slightly different use cases.

With xpystac=0.0.1, you have xr.open_dataset(item_collection, ...) using stackstac in the backend. But realistically, you could swap stacstac for odc-stac to remove the .to_dataset call here:

https://github.com/jsignell/xpystac/blob/65b08c26603b6f64d5fe388973b26c8a29bf16a9/xpystac/core.py#L30

In addition, you could register xr.open_dataarray() to use stackstac instead. Of course, this might need some documentation to be clear that STAC ItemCollections passed to xr.open_dataarray() are stacked using stackstac.stack while those passed to xr.open_dataset() are stacked with odc.stac.load.

jsignell commented 1 year ago

In addition, you could register xr.open_dataarray() to use stackstac instead. Of course, this might need some documentation to be clear that STAC ItemCollections passed to xr.open_dataarray() are stacked using stackstac.stack while those passed to xr.open_dataset() are stacked with odc.stac.load.

Oh that is an interesting idea. I wonder if that would feel surprising to the user.

maawoo commented 1 year ago

I just stumbled upon this discussion and wanted to add to @weiji14's comment, that a major difference is also the parsing of STAC metadata to Xarray, which in my opinion is an important difference to consider. Quoting from https://github.com/opendatacube/odc-stac/issues/54#issuecomment-1103313511 :

Access to the original STAC metadata

  • odc-stac doesn't really expose any of that, and there is a fundamental design choice that makes it impossible to do in a general case, but we can certainly add it for special case data loading in the future.
  • stackstac exposes all the metadata fields in the returned xarray, combined with delayed computation enabled by Dask this can be very handy as you can leverage all the xarray conveniences to filter out unwanted data.

Here is an example of how it can look like in practice with a dataset created from https://github.com/SAR-ARD/S1_NRB :

image

Users can then easily filter the array based on the parsed STAC Item properties:

ds_filtered = ds.where((ds['sat:relative_orbit'] == 44), drop=True)

I am working a lot with local, static STAC Catalogs without using an API or database to do the querying beforehand. @weiji14's suggestion is interesting and could be a bridge between both libraries. I don't think there is a shift to one or the other and I also don't think there will be The One™ anytime soon. I think it's best to not press forward too fast with #26.

jsignell commented 1 year ago

Thank you for commenting! I had reached a similar decision last week and updated #26 to make the stacking library configurable as suggested by @weiji14. I just renamed the PR to indicate that change in functionality.