Open TomAugspurger opened 2 years ago
To me, Intake seems like the best way forward here.
In case you do want to go forward with this proposal, I'd like to propose another variation. Instead of having these conversion functions in Pystac itself, provide an entry_point
for this. Then these conversion functions can be distributed in separate packages, users choose what to get by installing the plugins they want, the code stays simple, and the dependency handling becomes easier and transparent. Of course, in some sense that is moving you closer to Intake again.
Thanks for this @TomAugspurger , I am 100% on board with this and would love to see such convenience functions, I end up copying over the same function across notebooks that converts STAC Items to a dataframe.
I'd generally prefer targeting libraries to container types since container type names may be ambiguous.
I prefer this over using intake, only because I don't use intake-stac myself, and it will just be another library to maintain,
Thanks @zklaus.
Instead of having these conversion functions in Pystac itself, provide an entry_point for this
Let me see if I understand this correctly. pystac itself would define a few methods:
class ItemCollection:
def to_geopandas_geodataframe(self, **kwargs) -> "geopandas.GeoDataFrame":
entrypoint = importlib.metadata.entry_points().get(...)
return entrypoint.load()(self, **kwargs)
def to_xarray_datarray(self, **kwargs): -> "xarray.DataArray":
entrypoint = importlib.metadata.entry_points().get(...)
return entrypoint.load()(self, **kwargs)
I'm not 100% sure what goes in the .get()
but something that uniquely identifies the ItemCollection -> geopandas.GeoDataFrame
route. That's pretty similar to what I had in mind, which was roughly
def to_geopandas_geodataframe(self, **kwargs) -> "geopandas.GeoDataFrame":
try:
import some_library
except ImportError:
# raise with a nice method
return some_library.item_collection_to_geopandas(self, **kwargs)
A benefit / drawback of entrypoints is that you could have multiple implementations for the same "route". Some 3rd party can provide a better implementation with having to touch pystac or some_library
. And pystac
doesn't have to know anything about a particular implementation. pystac would just need to provide
ItemCollection.to_geopandas_geodataframe
method (but it will defer the implementation).We might need to figure out what to do if there are multiple implementations registered for a specific entrypoint. Perhaps provide a keyword like engine
in each method to choose between them?
It might be worth saying a bit more about "why not intake" given that we are encroaching on its territory a bit. A laundry list:
__getitem__
API, which implies a single, linear path between parent containers and children. But some STAC containers have two kinds of children: A Collection might have children items and children assets. to_*
methods.BaseCatalog
object for your implementation. That adds a few methods and properties to the public API of your object. And some intake methods like Catalog.search
have different APIs than what would be expected from a STAC community.Thanks for picking this up, @TomAugspurger. What you describe is close, but not quite what I had in mind. Indeed, I would say if Pystac has to know all the exporters, entry points would be a futile exercise. Instead, I think there should be only one entry point, say pystac.exporters
, and any library can register a function to this. Pystac would attach those functions to ItemCollection
by iterating over that entry point. This way, Pystac knows nothing about these functions and the user is free to install any exporter they want along with its dependencies.
Conceivably, other classes could use the same mechanism with a different entry point, say to export to an iris.CubeList
.
there should be only one entry point, say pystac.exporters, and any library can register a function to this.
Mmm, I'm not a huge fan of dynamically adding methods to classes as a side-effect of import. It makes debugging difficult, static typing impossible (I think), and will be incompatible with features like https://peps.python.org/pep-0690/ (I think).
What's the reason not to add various to_*
methods? We already have to_dict
. I'm not too worried about the cognitive burden of adding these methods since they'll all be using the same to_<foo>
name and will have similar, simple signatures.
Do we know how mypy, pylance etc feels about the return types in downstream contexts? E.g., if the method returns "geopandas.GeoDataFrame", and the types of pystac are published and consumed by downstream code that doesn't have geopandas installed - is mypy and pylance (vscode) OK with that, or would that cause weird typing errors?
Both mypy and pylance seem to think the type is Any
(using xarray instead of geopandas, since I don't think geopandas is currently statically typed).
Setup:
Test file:
# file: foo.py
import pystac
ic = pystac.ItemCollection([])
reveal_type(ic.to_xarray_datarray())
Output without xarray:
$ mypy foo.py
foo.py:5: note: Revealed type is "Any"
Success: no issues found in 1 source file
Output with xarray:
$ mypy foo.py
foo.py:5: note: Revealed type is "xarray.core.dataarray.DataArray"
Success: no issues found in 1 source file
pylance has the same behavior.
Historically, PySTAC has been very much in the design camp of "do the most simple, scoped thing possible, and carry no dependencies". I think this has served the library well. In that stance, this suggestion could be identified as scope creep - as you mentioned, getting the results these "to_" methods are aimed at is already possible outside of PySTAC, and adding them has the potential to inflate the codebase/scope of PySTAC.
However, I think your other points are really good. Now that it's been around a while, PySTAC "owns the namespace" as you put it, and there's an opportunity to expand on the user convenience this library provides. Your experience with seeing this play out in the pandas ecosystem holds a lot of weight here.
As you've laid out, we can keep the dependencies out of the core PySTAC package by using deferred imports. Adding "extras" packages would be a fine way to enable this extra behavior, like we do for pystac[validation]
. That way the "no dependencies" stance can still hold by default.
As far as library scope - I think what you suggested works, if the heavy lifting/implementations live in other libraries. You mentioned the possibility of an Asset -> Zarr
implementation living in PySTAC - IMO we should limit the scope of implementation code in this codebase to focus on STAC concepts. My preference would be for all conversion implementation code to live outside PySTAC and be brought in as part of an extras dependency. This complicates things a bit in that the user will not only need e.g. Zarr installed, but some zarr-pystac
library to use this if they have not installed pystac[zarr]
. However I think that would keep this codebase concentrated on the STAC of it all, and not have to concern itself with maintaining/reviewing etc code unrelated to the core STAC types. Perhaps we could have a "pystac_contrib" namespace package developed in its own repository to hold conversion implementations that do not already have a home.
All in all, I'm +1 on this idea.
It's a can of worms: Why to_xarray and not to_numpy(), to_PIL, ...? Why to_pandas() and not to_spark(), to_modin, ...?
Instead of (or complementary to) pystac
providing the API and/or entrypoint, couldn't be something that is provided by the container libraries?
I'm not sure how this would work for Pandas and GeoPandas, but for Xarray there could be some_library
implementing an IO backend such that:
asset = (
catalog
.get_collection("sentinel-2-l2a")
.get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
.assets["B03"]
)
ds = xr.open_dataset(asset, engine="stac")
item_collection = (
catalog
.search(collections="sentinel-2-l2a", bbox=bbox)
.get_all_items()
)
ds = xr.open_dataset(item_collection, engine="stac")
Having to_
methods in pystac
would certainly be more discoverable. It depends on the user point of view, though, which is why both approaches may be complementary to each other despite that it is two ways of doing the same thing.
I was poking around on intake-stac today and @scottyhq pointed me to this issue. I really like @benbovy's idea of specifying an IO backend that would be accessed from xarray itself. That seems like a natural fit to me and much more analogous to the pd.read_csv
example that is the motivation for this issue.
Just to expand on that. I think one of the issues with a generic xr.open_dataset
is that it can be hide the user from the actual kwargs that they have access to for a particular engine. You end up having to figure out what then engine maps to and then going and reading the docs for the reader on the engine.
Maybe the convention of storing the open_kwargs in the STAC asset gets around that issue. Or maybe there needs to be a way of passing engine specific docstrings up into the xarray docs. This might already exist, I couldn't quite tell
Thanks @jsignell!
Maybe the convention of storing the open_kwargs in the STAC asset gets around that issue.
That's my hope. That's what we're doing at https://github.com/TomAugspurger/xstac/blob/f17ed3bb79f34b747d1532fc425794442d938f9e/examples/terraclimate/terraclimate.json#L51.
But that's still compatible with the idea of an engine='stac'
. It would just need to know what to look for and then that engine would make the call to xr.open_dataset(engine="zarr")
or whatever was in xarray:open_kwargs
.
I think the xr.open_dataset(stuff, engine="stac")
vs. asset.to_xarray_dataset()
question is mostly around ergonomics for the user. I think the two implementations would look pretty similar internally.
But that's still compatible with the idea of an
engine='stac'
. It would just need to know what to look for and then that engine would make the call toxr.open_dataset(engine="zarr")
or whatever was inxarray:open_kwargs
.
Yes exactly!
I am going to mock up this approach now.
Ok I made https://github.com/jsignell/xpystac closely based off of https://github.com/TomAugspurger/staccontainers.
I'm trying to think what the next steps are here. I am happy to move xpystac
into stac-utils
for visibility.
Then maybe pystac would want to have a to_xarray
stub that could try to import and run xpystac.to_xarray
and fail gracefully if xpystac is not available?
I am happy to move xpystac into stac-utils for visibility.
Yup, we don't have a formal process (see: https://stac-utils.github.io/developer-guide/policies/repositories.html) so if it feels right, go for it!
Then maybe pystac would want to have a to_xarray stub that could try to import and run xpystac.to_xarray and fail gracefully if xpystac is not available?
Makes sense to me. Are there other stubs we should add at the same time? It'd be a splashy feature to release so it'd be cool to have a couple different adapters, if possible.
Ok! I just transferred the repository over to the stac-utils org.
Makes sense to me. Are there other stubs we should add at the same time? It'd be a splashy feature to release so it'd be cool to have a couple different adapters, if possible.
Tom has a whole list of potential stubs in the original post and some implementations in staccontainers.. There are also some good arguments for not implementing stubs.
I think if we do want to add stubs it would make more sense to do that as part of 2.0. Maybe as a first step, I should update the examples to use xpystac via open_dataset.
I think if we do want to add stubs it would make more sense to do that as part of 2.0. Maybe as a first step, I should update the examples to use xpystac via open_dataset.
worksforme
For this sprint I am planning to add a little tutorial notebook that explains how to read an asset into an xarray object using xpystac.
Thanks for continuing to work on this @jsignell!
Just to expand on that. I think one of the issues with a generic
xr.open_dataset
is that it can be hide the user from the actual kwargs that they have access to for a particular engine. You end up having to figure out what then engine maps to and then going and reading the docs for the reader on the engine.
This sounds like most of my life with Python! ;-)
Maybe the convention of storing the open_kwargs in the STAC asset gets around that issue. Or maybe there needs to be a way of passing engine specific docstrings up into the xarray docs. This might already exist, I couldn't quite tell
Now that I know about this open_kwargs
in STAC assets, I quite like the idea. The asset 'knows' how it should be read... My journey here has been something like "get my head around stac_load
then find it doesn't work for some assets then realise you need to use xarray.open_dataset
but also find you need to grab these open_kwargs
and pass them in".
And, after all, I think it's fair to say that the whole point of STAC is to make large geospatial/temporal raster assets findable and then accessible, so some convenience functions (e.g. living in a namespace package?) that make for a more seamless bridge between the catalog/finding and the raster data exploitation would be hugely welcome.
Or maybe there needs to be a way of passing engine specific docstrings up into the xarray docs.
Does not exist but it sounds interesting. IMO a separate stac_to_xarray
method is a lot cleaner, more readable, and less magic.
I have been mulling over what would be the most discoverable and ergonomic way to give access to read methods and I am strongly leaning towards creating an xarray assets extension class. Then on this class there could be an open
method. So the flow would be:
import pystac
from pystac.extensions.xarray_assets import XarrayAssetsExtension
asset = pystac.Asset.from_file(...)
xr_ext = XarryAssetsExtension.ext(asset, add_if_missing=True)
xr_ext.open()
Or if something like #1051 gains traction then:
import pystac
asset = pystac.Asset.from_file(...)
asset.xarray.open()
Hi all,
I'm tentatively making a pitch to add convenience methods for converting pystac objects (Asset, Item, ItemCollection, ...) and their linked assets to commonly used data containers (
xarray.Dataset
,geopandas.GeoDataFrame
,pandas.DataFrame
, etc.).I'm opening this in
pystac
since this is primarily for convenience, so that users can method-chain their way from STAC Catalog to data container, andpystac
owns the namespaces I care about. You can already do everything I'm showing today without any changes topystac
but it feels less nice. I really think thatpd.read_csv
is part of why Python is where it is today for data analytics; I want using STAC from Python to be as easy to use aspd.read_csv
.Secondarily, it can elevate the best-practice way to go from STAC to data containers, by providing a top-level method similar to
to_dict()
.As a couple hypothetical examples, to give an idea:
Or building a datacube from a pystac-client search (which subclasses pystac).
Implementation details
This would be optional.
pystac
would not add required dependencies onpandas
,xarray
, etc. It would merely provide the methodsItem.to_xarray
,Asset.to_xarray
, ... Internally those methods would try to import the implementation and raise anImportError
if the optional dependencies aren't met at runtime.Speaking of the implementations, there's a few things to figure out. Some relatively complicated conversions (like ItemCollection -> xarray) are implemented multiple times (https://stackstac.readthedocs.io/, https://odc-stac.readthedocs.io/en/latest/examples.html).
pystac
certainly wouldn't want to re-implement that conversion and would dispatch to one or either of those libraries (perhaps letting users decide with anengine
argument).Others conversions, like Asset -> Zarr, are so straightforward they haven't really been codified in a library yet (though I have a prototype at https://github.com/TomAugspurger/staccontainers/blob/086c2a7d46520ca5213d70716726b28ba6f36ba5/staccontainers/_xarray.py#L61-L63). Maybe those could live in pystac; I'd be happy to maintain them.
https://github.com/TomAugspurger/staccontainers might serve as an idea of what some of these implementations would look like. It isn't too complicated.
Problems
A non-exhaustive list of reasons not to do this:
to_xarray
and notto_numpy()
,to_PIL
, ...? Whyto_pandas()
and notto_spark()
,to_modin
, ...?Alternatives
Alternatively, we could recommend using intake, along with intake-stac, which would wrap
pystac-client
andpystac
. That would be the primary "user-facing" catalog people actually interface with. It already has a rich ecosystem of drivers that convert from files to data containers. I've hit some issues with trying to use intake-stac, but those could presumably be fixed with some effort.Or more generally, another library could wrap pystac / pystac-client and provide these convenience methods. But (to me) that feels needlessly complicated.
Examples
A whole bunch of examples, under the
<details>
, to give some ideas of the various conversions. You can view the outputs at https://notebooksharing.space/view/db3c2096bf7e9c212425d00746bb17232e6b26cdc63731022fc2697c636ca4ed#displayOptions=.Proposed Methods
We should figure out what the "target" in each of these
to_
methods is. I think there are a couple things to figure out:to_dataframe(..., engine="dask")
orto_dataframe(..., npartitions=...)
) or alternative methods.to_dask_dataframe()
).But I would loosely propose
Asset.to_xarray
->xarray.Dataset
{Item,ItemCollection}.to_xarray
->xarray.Dataset
Asset.to_geopandas
-> geopandas.GeoDataFrameAsset.to_pandas
-> pandas.DataFrameAsset.to_dask_geopandas
->dask_geopandas.GeoDataFrame
Asset.to_dask_dataframe
->dask.dataframe.DataFrame
ItemCollection.to_geopandas
->geopandas.GeoDataFrame
There's a bunch more to figure out, but hopefully that's enough to get started.