Convenience methods for converting STAC objects / linked assets to data containers

TomAugspurger commented 2 years ago

Hi all,

I'm tentatively making a pitch to add convenience methods for converting pystac objects (Asset, Item, ItemCollection, ...) and their linked assets to commonly used data containers (xarray.Dataset, geopandas.GeoDataFrame, pandas.DataFrame, etc.).

I'm opening this in pystac since this is primarily for convenience, so that users can method-chain their way from STAC Catalog to data container, and pystac owns the namespaces I care about. You can already do everything I'm showing today without any changes to pystac but it feels less nice. I really think that pd.read_csv is part of why Python is where it is today for data analytics; I want using STAC from Python to be as easy to use as pd.read_csv.

Secondarily, it can elevate the best-practice way to go from STAC to data containers, by providing a top-level method similar to to_dict().

As a couple hypothetical examples, to give an idea:

ds = (
    catalog
        .get_collection("sentinel-2-l2a")
        .get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
        .assets["B03"]
        .to_xarray()
)
ds

Or building a datacube from a pystac-client search (which subclasses pystac).

ds = (
    catalog
        .search(collections="sentinel-2-l2a", bbox=bbox)
        .get_all_items()  # ItemCollection
        .to_xarray()
)
ds

Implementation details

This would be optional. pystac would not add required dependencies on pandas, xarray, etc. It would merely provide the methods Item.to_xarray, Asset.to_xarray, ... Internally those methods would try to import the implementation and raise an ImportError if the optional dependencies aren't met at runtime.

Speaking of the implementations, there's a few things to figure out. Some relatively complicated conversions (like ItemCollection -> xarray) are implemented multiple times (https://stackstac.readthedocs.io/, https://odc-stac.readthedocs.io/en/latest/examples.html). pystac certainly wouldn't want to re-implement that conversion and would dispatch to one or either of those libraries (perhaps letting users decide with an engine argument).

Others conversions, like Asset -> Zarr, are so straightforward they haven't really been codified in a library yet (though I have a prototype at https://github.com/TomAugspurger/staccontainers/blob/086c2a7d46520ca5213d70716726b28ba6f36ba5/staccontainers/_xarray.py#L61-L63). Maybe those could live in pystac; I'd be happy to maintain them.

https://github.com/TomAugspurger/staccontainers might serve as an idea of what some of these implementations would look like. It isn't too complicated.

Problems

A non-exhaustive list of reasons not to do this:

It's not strictly necessary: You can do all this today, with some effort.
It's a can of worms: Why to_xarray and not to_numpy(), to_PIL, ...? Why to_pandas() and not to_spark(), to_modin, ...?

Alternatives

Alternatively, we could recommend using intake, along with intake-stac, which would wrap pystac-client and pystac. That would be the primary "user-facing" catalog people actually interface with. It already has a rich ecosystem of drivers that convert from files to data containers. I've hit some issues with trying to use intake-stac, but those could presumably be fixed with some effort.

Or more generally, another library could wrap pystac / pystac-client and provide these convenience methods. But (to me) that feels needlessly complicated.

Examples

A whole bunch of examples, under the <details>, to give some ideas of the various conversions. You can view the outputs at https://notebooksharing.space/view/db3c2096bf7e9c212425d00746bb17232e6b26cdc63731022fc2697c636ca4ed#displayOptions=.

### catalog -> collection -> item -> asset -> xarray (raster) ```python ds = ( catalog .get_collection("sentinel-2-l2a") .get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251") .assets["B03"] .to_xarray() ) ``` ### catalog -> collection -> item -> asset -> xarray (zarr) ```python ds = ( catalog .get_collection("cil-gdpcir-cc0") .get_item("cil-gdpcir-INM-INM-CM5-0-ssp585-r1i1p1f1-day") .assets["pr"] .to_xarray() ) ``` ### catlaog -> collection -> item -> asset -> xarray (references) ```python ds = ( catalog .get_collection("deltares-floods") .get_item("NASADEM-90m-2050-0010") .assets["index"] .to_xarray() ) ``` ### catalog -> collection -> item -> asset -> geodataframe ```python df = ( catalog .get_collection("us-census") .get_item("2020-cb_2020_us_tbg_500k") .assets["data"] .to_geopandas() ) df.head() ``` ### search / ItemCollection -> geopandas ```python df = catalog.search(collections=["sentinel-2-l2a"], bbox=[9.4, 0, 9.5, 1]).to_geopandas() df ```

Proposed Methods

We should figure out what the "target" in each of these to_ methods is. I think there are a couple things to figure out:

Do we target container types ("to_dataframe", "to_datarray") or libraries ("to_pandas", "to_geopandas", "to_xarray", ...)
Do we support larger-than-memory results through an argument (to_dataframe(..., engine="dask") or to_dataframe(..., npartitions=...)) or alternative methods .to_dask_dataframe()).

But I would loosely propose

Asset.to_xarray -> xarray.Dataset
{Item,ItemCollection}.to_xarray -> xarray.Dataset
Asset.to_geopandas -> geopandas.GeoDataFrame
Asset.to_pandas -> pandas.DataFrame
Asset.to_dask_geopandas -> dask_geopandas.GeoDataFrame
Asset.to_dask_dataframe -> dask.dataframe.DataFrame
ItemCollection.to_geopandas -> geopandas.GeoDataFrame

There's a bunch more to figure out, but hopefully that's enough to get started.

zklaus commented 2 years ago

To me, Intake seems like the best way forward here.

In case you do want to go forward with this proposal, I'd like to propose another variation. Instead of having these conversion functions in Pystac itself, provide an entry_point for this. Then these conversion functions can be distributed in separate packages, users choose what to get by installing the plugins they want, the code stays simple, and the dependency handling becomes easier and transparent. Of course, in some sense that is moving you closer to Intake again.

matthewhanson commented 2 years ago

Thanks for this @TomAugspurger , I am 100% on board with this and would love to see such convenience functions, I end up copying over the same function across notebooks that converts STAC Items to a dataframe.

I'd generally prefer targeting libraries to container types since container type names may be ambiguous.

I prefer this over using intake, only because I don't use intake-stac myself, and it will just be another library to maintain,

TomAugspurger commented 2 years ago

Thanks @zklaus.

Instead of having these conversion functions in Pystac itself, provide an entry_point for this

Let me see if I understand this correctly. pystac itself would define a few methods:

class ItemCollection:
    def to_geopandas_geodataframe(self, **kwargs) -> "geopandas.GeoDataFrame":
        entrypoint = importlib.metadata.entry_points().get(...)
        return entrypoint.load()(self, **kwargs)

    def to_xarray_datarray(self, **kwargs): -> "xarray.DataArray":
        entrypoint = importlib.metadata.entry_points().get(...)
        return entrypoint.load()(self, **kwargs)

I'm not 100% sure what goes in the .get() but something that uniquely identifies the ItemCollection -> geopandas.GeoDataFrame route. That's pretty similar to what I had in mind, which was roughly

    def to_geopandas_geodataframe(self, **kwargs) -> "geopandas.GeoDataFrame":
        try:
            import some_library
        except ImportError:
            # raise with a nice method
       return some_library.item_collection_to_geopandas(self, **kwargs)

A benefit / drawback of entrypoints is that you could have multiple implementations for the same "route". Some 3rd party can provide a better implementation with having to touch pystac or some_library. And pystac doesn't have to know anything about a particular implementation. pystac would just need to provide

the entrypoint in its distribution
the ItemCollection.to_geopandas_geodataframe method (but it will defer the implementation).

We might need to figure out what to do if there are multiple implementations registered for a specific entrypoint. Perhaps provide a keyword like engine in each method to choose between them?

It might be worth saying a bit more about "why not intake" given that we are encroaching on its territory a bit. A laundry list:

AFAIK, intake is primarily based around a __getitem__ API, which implies a single, linear path between parent containers and children. But some STAC containers have two kinds of children: A Collection might have children items and children assets.
I've personally struggled with the implementation. Partly that's because intake is a more complex, feature-full library than what I'm looking for at this stage.
I think that the broader community of pystac users would benefit from this, without having to learn a new API beyond the simple to_* methods.
(minor nitpicks): intake (I think) wants you inheriting from a BaseCatalog object for your implementation. That adds a few methods and properties to the public API of your object. And some intake methods like Catalog.search have different APIs than what would be expected from a STAC community.

zklaus commented 2 years ago

Thanks for picking this up, @TomAugspurger. What you describe is close, but not quite what I had in mind. Indeed, I would say if Pystac has to know all the exporters, entry points would be a futile exercise. Instead, I think there should be only one entry point, say pystac.exporters, and any library can register a function to this. Pystac would attach those functions to ItemCollection by iterating over that entry point. This way, Pystac knows nothing about these functions and the user is free to install any exporter they want along with its dependencies.

Conceivably, other classes could use the same mechanism with a different entry point, say to export to an iris.CubeList.

TomAugspurger commented 2 years ago

there should be only one entry point, say pystac.exporters, and any library can register a function to this.

Mmm, I'm not a huge fan of dynamically adding methods to classes as a side-effect of import. It makes debugging difficult, static typing impossible (I think), and will be incompatible with features like https://peps.python.org/pep-0690/ (I think).

What's the reason not to add various to_* methods? We already have to_dict. I'm not too worried about the cognitive burden of adding these methods since they'll all be using the same to_<foo> name and will have similar, simple signatures.

lossyrob commented 2 years ago

Do we know how mypy, pylance etc feels about the return types in downstream contexts? E.g., if the method returns "geopandas.GeoDataFrame", and the types of pystac are published and consumed by downstream code that doesn't have geopandas installed - is mypy and pylance (vscode) OK with that, or would that cause weird typing errors?

TomAugspurger commented 2 years ago

Both mypy and pylance seem to think the type is Any (using xarray instead of geopandas, since I don't think geopandas is currently statically typed).

Setup:

```diff diff --git a/pystac/item_collection.py b/pystac/item_collection.py index de38ac3..fdd4292 100644 --- a/pystac/item_collection.py +++ b/pystac/item_collection.py @@ -1,11 +1,14 @@ from copy import deepcopy from pystac.errors import STACTypeError -from typing import Any, Dict, Iterator, List, Optional, Collection, Iterable, Union +from typing import Any, Dict, Iterator, List, Optional, Collection, Iterable, Union, TYPE_CHECKING import pystac from pystac.utils import make_absolute_href, is_absolute_href from pystac.serialization.identify import identify_stac_object_type +if TYPE_CHECKING: + import xarray + ItemLike = Union[pystac.Item, Dict[str, Any]] @@ -235,3 +238,6 @@ class ItemCollection(Collection[pystac.Item]): identify_stac_object_type(feature) == pystac.STACObjectType.ITEM for feature in d.get("features", []) ) + + def to_xarray_datarray(self) -> "xarray.DataArray": + ... ```

Test file:

# file: foo.py
import pystac

ic = pystac.ItemCollection([])
reveal_type(ic.to_xarray_datarray())

Output without xarray:

$ mypy foo.py
foo.py:5: note: Revealed type is "Any"
Success: no issues found in 1 source file

Output with xarray:

$ mypy foo.py
foo.py:5: note: Revealed type is "xarray.core.dataarray.DataArray"
Success: no issues found in 1 source file

pylance has the same behavior.

lossyrob commented 2 years ago

Historically, PySTAC has been very much in the design camp of "do the most simple, scoped thing possible, and carry no dependencies". I think this has served the library well. In that stance, this suggestion could be identified as scope creep - as you mentioned, getting the results these "to_" methods are aimed at is already possible outside of PySTAC, and adding them has the potential to inflate the codebase/scope of PySTAC.

However, I think your other points are really good. Now that it's been around a while, PySTAC "owns the namespace" as you put it, and there's an opportunity to expand on the user convenience this library provides. Your experience with seeing this play out in the pandas ecosystem holds a lot of weight here.

As you've laid out, we can keep the dependencies out of the core PySTAC package by using deferred imports. Adding "extras" packages would be a fine way to enable this extra behavior, like we do for pystac[validation]. That way the "no dependencies" stance can still hold by default.

As far as library scope - I think what you suggested works, if the heavy lifting/implementations live in other libraries. You mentioned the possibility of an Asset -> Zarr implementation living in PySTAC - IMO we should limit the scope of implementation code in this codebase to focus on STAC concepts. My preference would be for all conversion implementation code to live outside PySTAC and be brought in as part of an extras dependency. This complicates things a bit in that the user will not only need e.g. Zarr installed, but some zarr-pystac library to use this if they have not installed pystac[zarr]. However I think that would keep this codebase concentrated on the STAC of it all, and not have to concern itself with maintaining/reviewing etc code unrelated to the core STAC types. Perhaps we could have a "pystac_contrib" namespace package developed in its own repository to hold conversion implementations that do not already have a home.

All in all, I'm +1 on this idea.

benbovy commented 1 year ago

It's a can of worms: Why to_xarray and not to_numpy(), to_PIL, ...? Why to_pandas() and not to_spark(), to_modin, ...?

Instead of (or complementary to) pystac providing the API and/or entrypoint, couldn't be something that is provided by the container libraries?

I'm not sure how this would work for Pandas and GeoPandas, but for Xarray there could be some_library implementing an IO backend such that:

asset = (
    catalog
        .get_collection("sentinel-2-l2a")
        .get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
        .assets["B03"]
)

ds =  xr.open_dataset(asset, engine="stac")

item_collection = (
    catalog
        .search(collections="sentinel-2-l2a", bbox=bbox)
        .get_all_items()
)

ds = xr.open_dataset(item_collection, engine="stac")

Having to_ methods in pystac would certainly be more discoverable. It depends on the user point of view, though, which is why both approaches may be complementary to each other despite that it is two ways of doing the same thing.

jsignell commented 1 year ago

I was poking around on intake-stac today and @scottyhq pointed me to this issue. I really like @benbovy's idea of specifying an IO backend that would be accessed from xarray itself. That seems like a natural fit to me and much more analogous to the pd.read_csv example that is the motivation for this issue.

jsignell commented 1 year ago

Just to expand on that. I think one of the issues with a generic xr.open_dataset is that it can be hide the user from the actual kwargs that they have access to for a particular engine. You end up having to figure out what then engine maps to and then going and reading the docs for the reader on the engine.

Maybe the convention of storing the open_kwargs in the STAC asset gets around that issue. Or maybe there needs to be a way of passing engine specific docstrings up into the xarray docs. This might already exist, I couldn't quite tell

TomAugspurger commented 1 year ago

Thanks @jsignell!

Maybe the convention of storing the open_kwargs in the STAC asset gets around that issue.

That's my hope. That's what we're doing at https://github.com/TomAugspurger/xstac/blob/f17ed3bb79f34b747d1532fc425794442d938f9e/examples/terraclimate/terraclimate.json#L51.

But that's still compatible with the idea of an engine='stac'. It would just need to know what to look for and then that engine would make the call to xr.open_dataset(engine="zarr") or whatever was in xarray:open_kwargs.

I think the xr.open_dataset(stuff, engine="stac") vs. asset.to_xarray_dataset() question is mostly around ergonomics for the user. I think the two implementations would look pretty similar internally.

jsignell commented 1 year ago

But that's still compatible with the idea of an engine='stac'. It would just need to know what to look for and then that engine would make the call to xr.open_dataset(engine="zarr") or whatever was in xarray:open_kwargs.

Yes exactly!

I am going to mock up this approach now.

jsignell commented 1 year ago

Ok I made https://github.com/jsignell/xpystac closely based off of https://github.com/TomAugspurger/staccontainers.

jsignell commented 1 year ago

I'm trying to think what the next steps are here. I am happy to move xpystac into stac-utils for visibility.

Then maybe pystac would want to have a to_xarray stub that could try to import and run xpystac.to_xarray and fail gracefully if xpystac is not available?

gadomski commented 1 year ago

I am happy to move xpystac into stac-utils for visibility.

Yup, we don't have a formal process (see: https://stac-utils.github.io/developer-guide/policies/repositories.html) so if it feels right, go for it!

Then maybe pystac would want to have a to_xarray stub that could try to import and run xpystac.to_xarray and fail gracefully if xpystac is not available?

Makes sense to me. Are there other stubs we should add at the same time? It'd be a splashy feature to release so it'd be cool to have a couple different adapters, if possible.

jsignell commented 1 year ago

Ok! I just transferred the repository over to the stac-utils org.

Makes sense to me. Are there other stubs we should add at the same time? It'd be a splashy feature to release so it'd be cool to have a couple different adapters, if possible.

Tom has a whole list of potential stubs in the original post and some implementations in staccontainers.. There are also some good arguments for not implementing stubs.

I think if we do want to add stubs it would make more sense to do that as part of 2.0. Maybe as a first step, I should update the examples to use xpystac via open_dataset.

gadomski commented 1 year ago

I think if we do want to add stubs it would make more sense to do that as part of 2.0. Maybe as a first step, I should update the examples to use xpystac via open_dataset.

worksforme

jsignell commented 1 year ago

For this sprint I am planning to add a little tutorial notebook that explains how to read an asset into an xarray object using xpystac.

TomAugspurger commented 1 year ago

Thanks for continuing to work on this @jsignell!

gtmaskall commented 1 year ago

Just to expand on that. I think one of the issues with a generic xr.open_dataset is that it can be hide the user from the actual kwargs that they have access to for a particular engine. You end up having to figure out what then engine maps to and then going and reading the docs for the reader on the engine.

This sounds like most of my life with Python! ;-)

Maybe the convention of storing the open_kwargs in the STAC asset gets around that issue. Or maybe there needs to be a way of passing engine specific docstrings up into the xarray docs. This might already exist, I couldn't quite tell

Now that I know about this open_kwargs in STAC assets, I quite like the idea. The asset 'knows' how it should be read... My journey here has been something like "get my head around stac_load then find it doesn't work for some assets then realise you need to use xarray.open_dataset but also find you need to grab these open_kwargs and pass them in".

And, after all, I think it's fair to say that the whole point of STAC is to make large geospatial/temporal raster assets findable and then accessible, so some convenience functions (e.g. living in a namespace package?) that make for a more seamless bridge between the catalog/finding and the raster data exploitation would be hugely welcome.

dcherian commented 1 year ago

Or maybe there needs to be a way of passing engine specific docstrings up into the xarray docs.

Does not exist but it sounds interesting. IMO a separate stac_to_xarray method is a lot cleaner, more readable, and less magic.

jsignell commented 1 year ago

I have been mulling over what would be the most discoverable and ergonomic way to give access to read methods and I am strongly leaning towards creating an xarray assets extension class. Then on this class there could be an open method. So the flow would be:

import pystac
from pystac.extensions.xarray_assets import XarrayAssetsExtension

asset = pystac.Asset.from_file(...)
xr_ext = XarryAssetsExtension.ext(asset, add_if_missing=True)
xr_ext.open()

Or if something like #1051 gains traction then:

import pystac

asset = pystac.Asset.from_file(...)
asset.xarray.open()

stac-utils / pystac