opendatacube / odc-stac

Load STAC items into xarray Datasets.
Apache License 2.0
136 stars 19 forks source link

clarification on difference between this library and stackstac? #54

Open rbavery opened 2 years ago

rbavery commented 2 years ago

Hi! I just learned about this library and I'm curious what the feature overlap is with stackstac and what the differences are if someone is familiar with both? which should I use in which scenario? https://github.com/gjoseph92/stackstac

Asking because currently we are considering how to show newcomers to geospatial python how to load imagery into dask-backed xarrays: https://github.com/carpentries-incubator/geospatial-python/pull/102. Any tips appreicated.

matthewhanson commented 2 years ago

Hello @rbavery ! The intent of them both is the same, which is to create XArrays from STAC Items. The way the XArrays are constructed are slightly different. odc-stac has more capability overall, such as being able to handle any type of CRS over just an EPSG code, but stackstac is a bit simpler to use. odc-stac is under current development, whereas the pace of stackstac is slower.

One thing to note is that as part of the odc-stac effort, odc-geo was released, https://github.com/opendatacube/odc-geo/, which can serve as a foundational library that a library like stackstac may be able to use.

@Kirill888 did some performance benchmarking vs stackstac that you can see here: https://benchmark-odc-stac-vs-stackstac.netlify.app/ and he may weigh in as well.

Also interested in @gjoseph92's thoughts on this.

gjoseph92 commented 2 years ago

It's a great question. stackstac was a proof-of-concept side project I put together in a month—I had wanted to see if STAC, COG, rasterio, and dask could work together nicely to have STAC->xarray "just work". I'd say it proved the concept. It's been great to see that it's useful to many people. But stackstac isn't my full-time job, so I don't get to work on it that much anymore.

The way the XArrays are constructed are slightly different

I think this is the biggest difference. Pretty much all of my effort in stackstac went to low-level optimization of both parallel data loading with GDAL and dask array construction, specifically for running on clusters. stackstac does a few things:

One other thing is stackstac.show, which lets you view raster data on a map. But this doesn't actually require stackstac at all; you could use it with odc-stac arrays as well.

odc-stac is probably more robust/reasonable at handling the STAC metadata and picking a default GeoBox. Also, I wish odc-geo had existed when I started stackstac; I would have loved to just use that. This sort of thing has been a weak point in the ecosystem, and hopefully now that there's a package for it, we can all consolidate on that!

Frankly, I don't think we really need to have two packages that do such similar things, and I'd love to see them combined someday. ODC has a lot more legacy, and institutional support, so maybe that'll become the standard. Mostly I care about the low-level optimizations; it would be nice to see those picked up in ODC. I do like the name stackstac though :)

performance benchmarking vs stackstac

These are great benchmarks. The "wide" case isn't quite an accurate comparison, since odc-stac has the groupby feature which allows you to push the mosaicing into the data-loading, instead of doing it after stacking like stackstac does. groupby is a very useful optimization when it works for your use-case, and it's something I've thought about adding to stackstac, but it's a little restrictive (you can't mask by a cloud band, or do some band math, and then groupby, for example).

Kirill888 commented 2 years ago

Some History

While odc-stac itself is relatively new, it is really an offshoot of Open Datacube project that has a long history. In it's current form odc-stac re-uses data loading components of the core datacube library that has been used for data production by Digitial Earth Australia, Digital Earth Africa and CSIRO (Australian national science agency, particularly their EASI hub) and other members of Open Datacube community.

Open Datacube predates STAC and in fact had some influence on the evolution of STAC spec (proj extension in particular), but because of that it has it's own metadata format and it's own metadata management systems. Metadata management part of the open datacube system is the biggest obstacle for adoption. Large organizations listed above have the resources to setup and maintain systems that enable their data scientists to focus on the data experiments and not metadata wrangling, but some phd student trying out some new ml model doesn't have that luxury.

odc-stac does away with all that metadata management hassle by outsourcing all metadata handling to the STAC ecosystem. You don't need to setup any databases or fiddle with metadata indexing, you just get some STAC items from stac api query, and then get back xarray dataset in the same way you would get it from open datacube system. So it was a way to enable convenient access to STAC data for users of open datacube. But odc-stac is useful on it's own, even if you are not part of open datacube ecosystem.

Comparison to stackstac

Both take STAC items and some data loading configuration and produce Dask backed xarrays of raster pixels, but there are some user facing differences that I'll try my best to enumerate

Fundamentally stackstac gives you an xarray view of STAC Items, you can more conveniently explore metadata fields than with just pystac.Item or plain dictionaries. Basically it's [pystac.Item] -> PandasDataframe + Delayed Pixels. There is clear one to one mapping from STAC Item to an index within returned xarray.DataArray.

In odc-stac it's all about raster planes instead, given some STAC Items, group them into timestamped pixel planes and expose those pixels after fusing multiple possibly overlapping items in some deterministic order.

^Benchmark report Matt linked earlier reflects that difference in focus nicely. stackstac excels at spatially narrow by temporally deep workflows, while odc-stac is significantly more performant in wide area workflows. If you need to build high resolution mosaic from hundreds of stac items you are better off with odc-stac, as stackstac will be quite a bit slower and more memory hungry. But if you need convenient access to STAC metadata when you are working with "deep time" data and need a clean 1:1 correspondence between STAC items and pixel planes, stackstac is the best choice.

^ I need to update it with results from more recent stackstac https://github.com/Kirill888/benchmark-report/issues/2 there has been some improvements in stackstac lib since, but fundamentally wide area case remains harder for stackstac than for odc-stac.

Kirill888 commented 2 years ago

@gjoseph92 thanks for respone some comments below:

Supports threadsafe parallel reads (I'm not sure how ODC handles concurrency to safely do multithreaded reads)

we use rasterio sharing=False when opening data, added when we first encountered those issues in datacube-core back then, also have external hard-locking when working with netcdf sources, not that matters in the cloud.

Manages the GDAL cache so multithreaded reading is as fast as possible

It's a very handy optimization that I'm planning to use in the next version of odc-stac that does away with datacube dependency.

Constructs the dask graph as efficiently as possible (you can construct graphs of any size in constant time, whereas ODC generates low-level graphs, which will get very slow for large datasets)

I have not benchmarked dask graph construction, but yes I agree that odc does significantly more of work up-front: for every spatial chunk we figure out which stac items contribute to that chunk, in a rather efficient way, but still it does take more time. The advantage is that you end up with a much more compact Dask graph in the "wide" area case. And since graph size has impact on graph submission time overall time from Stac items IN to first pixel OUT can be lower. This also means that empty slots are extremely cheap on the wire and for compute (no need to call to GDAL vrt to get nodata pixels out)

These are great benchmarks. The "wide" case isn't quite an accurate comparison, since odc-stac has the groupby feature which allows you to push the mosaicing into the data-loading, instead of doing it after stacking like stackstac does. groupby is a very useful optimization when it works for your use-case, and it's something I've thought about adding to stackstac, but it's a little restrictive (you can't mask by a cloud band, or do some band math, and then groupby, for example).

That is indeed the reason for better performance in wide area mosaic case. With odc-stac it is possible to disable grouping at construction time and instead perform that fusing in Dask as well. I have not benchmarked that case though. In theory one could detect "empty" slots and optimize Dask graph to avoid unnecessary fusing that way as well. I wish Dask Arrays had sparse mode where you simply don't even populate missing blocks. In odc-stack we know which blocks are just empties, which have just one item and which require fusing several, I think time spent on figuring that out upfront worth it, especially when constructing large mosaics.

rbavery commented 2 years ago

Thanks so much for all this context and comparison @matthewhanson @Kirill888 @gjoseph92 . I'm super impressed with both libraries and am excited that so much progress has been made on the problem of munging both wide and tall datasets with xarray.

idantene commented 1 year ago

These are great benchmarks. The "wide" case isn't quite an accurate comparison, since odc-stac has the groupby feature which allows you to push the mosaicing into the data-loading, instead of doing it after stacking like stackstac does. groupby is a very useful optimization when it works for your use-case, and it's something I've thought about adding to stackstac, but it's a little restrictive (you can't mask by a cloud band, or do some band math, and then groupby, for example).

Sorry for the bump, I'd like to revisit this specifically and ask for clarification (might be off-topic, but I wasn't sure where to bring this up).
The documentation for groupby kwarg does not really explain what it does behind the scene. I could only infer that it perhaps chooses some single pixel as a representative sample (similar maybe to stackstac.mosaic?), but it could just as well do some filtering, averaging, or other arithmetic, and I couldn't find the documentation for this.

Could you elaborate please on what does odc-stack does in this groupby (e.g. in groupby='solar_time') vs how would stackstac achieve the same end result?

Thanks!

EDIT: Upon further inspection I found https://github.com/opendatacube/odc-stac/blob/52a016be2115f180e7059f67bcc6106fbeba7e8d/odc/stac/_load.py#LL701C9-L701C57, which suggests that the first pixel in each day is chosen. Is that indeed the case? Is there additional filtering behind the scenes?

Kirill888 commented 1 year ago

The documentation for groupby kwarg does not really explain what it does behind the scene.

true. Original use case was stitching "true tiles"...

There are really two independent dimensions to groupby

  1. Deciding what stac items should be stitched together into one raster plane
  2. Exactly how this "stitching" should happen
    • merging method
    • precedence of data

Right now merging method is always "first observed valid pixel", with precedence for defining "first" being either

My understanding of stackstac.mosaic approach is to basically let user choose what is needed and leverage xarray rich ecosystem of data handling options. The cost of that approach is a significantly increased Dask graph size for "wide" use-case as there are a lot of "nothing to do here" tasks for which there is no proper Dask support (can't tell which chunk is "empty", each one needs to "run" before it can be used by the down-stream code). You can absolutely use the same approach with odc-stac, omit groupby parameter and use xarray.groupby to achieve whatever custom merging logic you want.

By the way PRs for documentation enhancements are welcome :).

idantene commented 1 year ago

Thanks @Kirill888, that helps a lot! I will hopefully make some PR for documentation once I'm done with our implementation of odc-stack (and stackstac).

If I understand correctly, to mimic groupby='solar_day' in stackstac, one needs to do something like stack.groupby(stack.time.astype('datetime64[D]')).first() and "hope" that the first pixel is indeed valid? How does odc-stack define a "valid pixel" in "first observed valid pixel"?

Kirill888 commented 1 year ago

If I understand correctly, to mimic groupby='solar_day' in stackstac, one needs to do something like stack.groupby(stack.time.astype('datetime64[D]')).first() and "hope" that the first pixel is indeed valid?

@idantene it's a bit more complicated than that:

  1. time is in UTC so just looking at day component of the timestamp is not enough. We adjust timestamp based on the mean longitude to correct for UTC.
  2. .first() won't deal with missing values, and so will only work with pure tiles without missing pixels.

How does odc-stack define a "valid pixel" in "first observed valid pixel"?

Pixel is valid if it's a finite number (not nan or inf for floating point data) and not equal to nodata attribute if that is set on the source imagery.

clausmichele commented 1 year ago

Thanks for this nice recap of the differences between the projects. On my side not having all the metadata parsed by odc-stac is what stops me in adopting it compared to stackstac. It is so useful to understand the data content and filtering out what's unnecessary.

maawoo commented 6 months ago

I think this recent discussion from the Pangeo Discourse forum should be linked here as well: https://discourse.pangeo.io/t/comparing-odc-stac-load-and-stackstac-for-raster-composite-workflow/4097/13