intake / intake-stac

Intake interface to STAC data catalogs
https://intake-stac.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
108 stars 25 forks source link

Stack / Mosaic STAC item collections #65

Open rmg55 opened 3 years ago

rmg55 commented 3 years ago

This STAC blog (specifically the section under Additional Flexibility - proj:shape and proj:transform) gave me the idea that gdal vrt with vrt mosaicking could be leveraged in intake-STAC to stack and mosaic intake-stac item collections.

Here is a gist showing an example of temporally stacking a single band of Sentinel-2a COGS using this approach: https://gist.github.com/rmg55/1acf804ef1af0c7934b265b3a653a486

A similar approach could also be implemented to mosaic data as well (although I have yet to try it). To apply this feature, the STAC projection extension would need to be implemented in the STAC catalog. I could not find any existing work by the Open Data Cube folks on this effort (they were referenced in the blog post mentioned above).

A few questions:

  1. Is this the optimum way to stitch together these datasets that are organized in granules/tiled files?
  2. Does it seem likely that the projection extension will be incorporated in other catalogs?
  3. Is this a feature that would make sense to incorporate directly into intake-stac (if so I would be happy to work on this)?

Tagging @scottyhq and @jhamman

scottyhq commented 3 years ago

Hi @rmg55 , thanks for sharing your ideas the detailed example!

Is this the optimum way to stitch together these datasets that are organized in granules/tiled files?

There are certainly different ways to go about this, and I'm sure of an optimal approach. I have often just used calls to gdalbuildvrt from within python code https://github.com/scottyhq/sentinel1-rtc/blob/main/Sentinel1-RTC-example.ipynb.

Does it seem likely that the projection extension will be incorporated in other catalogs?

does seem likely. but the work-around if it isn't there is open the asset file (metadata only) and store the CRS

Is this a feature that would make sense to incorporate directly into intake-stac (if so I would be happy to work on this)?

definitely I think this will be a common use case, but i think the same functionality could be easily implemented in intake-xarray, which does the actual loading and mapping to an xarray dataset. I'm thinking having the option to use rioxarray would be great https://github.com/intake/intake-xarray/issues/87. See also https://gis.stackexchange.com/questions/376685/issue-while-performing-merge-mosaic-of-multiple-tiff-files-using-rioxarray-pytho/

That said, maybe an catalog.write_vrt() method without a dependency on GDAL could be handy, since VRTs are quite useful for other GIS applications and sharing with collaborators!...

rmg55 commented 3 years ago

Thanks for comments/suggestions @scottyhq! A few follow-up questions/thoughts:

i think the same functionality could be easily implemented in intake-xarray

Agreed! However, I think there are a few advantages of including a custom version of this for intake-stac.

  1. The metadata is included in the catalog. Therefore, for remote data, and with many files, it would be faster (aka - no need for http requests to open the metadata) to use the catalog directly. I did a quick comparison of these approaches using the gist I posted above:
    1. stac catalog to build the metadata: ~7 seconds
    2. using to_dask() to build the metadata: ~840 seconds (14 minutes) But perhaps there is a faster approach to gathering metadata rather than using to_dask() (perhaps async requests?).
  2. Ability to map an arbitrary metadata feature into the band dimension/coordinate (datetime in the above gist).

definitely I think this will be a common use case, but i think the same functionality could be easily implemented in intake-xarray, which does the actual loading and mapping to an xarray dataset. I'm thinking having the option to use rioxarray would be great intake/intake-xarray#87. See also https://gis.stackexchange.com/questions/376685/issue-while-performing-merge-mosaic-of-multiple-tiff-files-using-rioxarray-pytho/

hmmm, I am not sure I 100% follow this. I agree that adding RioXarray as an option in intake-xarray would be great. However, I have not found an efficient method to reproject/mosaicing xarray objects. I have used rioxarray.rio.reproject followed by xr.combine_by_coords for smaller datasets (which seems to work well). However, I don't think it is possible to reproject dask-arrays without first using rasterio warped_vrt (see https://github.com/corteva/rioxarray/issues/119#issuecomment-674447208). Also, I think the gdal-backed mosaicing features a bit more flexible than what is currently in xarray (see - https://gis.stackexchange.com/questions/324602/averaging-overlapping-rasters-with-gdal).

I would envision a set of functions to build vrt files like:

For a single band: catalog.item.band.warp(target_crs=)

For stacking bands (typically temporally) with no spatial mosaicing (like the gist I linked above): catalog.stack_band(band=, target_crs=, remap_band_to=)

For stacking bands and spatially mosaicing: catalog.mosaic_stack_band(band=, target_crs=, overlapping_func=)

where
      band = the band to use in the stac item       target_crs = optional, specify to reproject       overlap_func= set of functions (mean,max,min,last, etc...) to use for overlapping tifs (see - https://gis.stackexchange.com/questions/324602/averaging-overlapping-rasters-with-gdal)

Thoughts?

scottyhq commented 3 years ago

@rmg55 I had a bit more time to think about this and look at your example. I think this would be great to implement perhaps in two separate PRs!

I think the primary goal of intake-stac should be to give you a DataArray (or DataSet) with minimal alteration to how the data is stored. to keep the scope narrow, reprojection, mosaicing, and other operations are available via the xarray api (or rioxarray api). That said, the workflow you're going for is really common and it'd be nice to have some built-in functions to make it easier! So how about:

  1. catalog.to_vrt(band , target_crs=None, output='items.vrt'). seems your approach is good here. requires optional GDAL dependency. A VRT is essentially another catalog format, with the advantage that GDAL does some trustworthy reprojection, and it's easy to share so that later on users just to da = xr.open_rasterio('items.vrt'). the Item id can be stored under each raster band description tag: <Description>S1B_20161121_12SYJ_ASC/Gamma0_VV.tif</Description>. Perhaps a stac2metdata keyword argument could take a dictionary of values to write out to VRT <Metadata> tags as well?

  2. catalog.stack_items(band, band_mapping=date).to_dask(). (note no reprojection or mosaicing). This is where i'm reluctant to add the intermediary VRT TempFile. One reason is that a major use-case for intake-stac is loading up data for distributed computing (e.g. dask KubeCluster). Another reason is that we can already do this with current dependencies, and just need a more convenient function. See this example: https://nbviewer.jupyter.org/gist/scottyhq/c30b2c4114e72f8f30c3d1df8330af26.

rmg55 commented 3 years ago

@scottyhq, really appreciate the explanations and guidance on this!

  1. catalog.to_vrt(band , target_crs=None, output='items.vrt')

Just to clarify, this would be for an entire catalog, correct? Each vrt band would correspond to a stac item in the catalog. This seems very doable. My only question/confusion would be for catalogs that span multiple tiles (e.g. there would be different spatial extents / grids) - would that be a valid .vrt file or xarray object? Perhaps we should check the bounding boxes on the items and throw an error if there are more than one bbox (or output mutliple files with _X suffixes - items_1.vrt, items_2.vrt, etc...)?

  1. catalog.stack_items(band, band_mapping=date).to_dask(). (note no reprojection or mosaicing). This is where i'm reluctant to add the intermediary VRT TempFile

I agree with your conclusion here - I really like the "path_as_pattern" approach you linked to in the gist (thanks for sharing that btw!). A more convenient function would be great to include. Do you know if the STAC spec insures that the catalogs be subdivided into "date" folders?

I think the one case that I am still unsure of how to accomplish using xarray/rioxarray + dask arrays + intake-stac would be the reproject + spatially mosaic + temporally stack workflow. Perhaps there is a workflow I am not familiar with that can efficiently accomplish this... This is the specific workflow I was levitating towards for the STAC --> vrt approach. Something along the lines of:

catalog.combine_to_vrt(band, mosaic_by=None, target_crs=None, output='mosaiced_stacked.vrt')

Where mosaic_by is optional (only needed if spatially mosaicing), but generally set to metadata field that represents the COG's date. Perhaps this is something to put on the back-burner and revisit at a later date if there seems to be need/desire within the community?

scottyhq commented 3 years ago

My only question/confusion would be for catalogs that span multiple tiles (e.g. there would be different spatial extents / grids) - would that be a valid .vrt file or xarray object?

Good point. definitely a complication is that a STAC catalog could contain items with assets that are all over the map and in different projections, so that suggestion applies to collections where the CRS is the same (a given S2 tile and L8 path row) being good examples.

Do you know if the STAC spec insures that the catalogs be subdivided into "date" folders?

Another good point! This is not enforced, so while that path-as-pattern approach is nice I don't think it can be relied on, instead we'd have to iterate through the catalog for paths. Looking into that here https://github.com/intake/intake-stac/issues/66.

I think the one case that I am still unsure of how to accomplish using xarray/rioxarray + dask arrays + intake-stac would be the reproject + spatially mosaic + temporally stack workflow

Let's keep experimenting with some simple examples (1 MRGS tile set --> web mercator tiles, adjacent MGRS grids into single xarray dataset?). Hopefully we can document some ways to do this even if it doesn't end up being a single function in this library.

bluetyson commented 3 years ago

Yes, e.g. I was looking at SA yesterday, and hence definitely more than one utm projection there

RichardScottOZ commented 3 years ago

However, median is not yet implemented on dask arrays so not sure a median mosaic will be doable easily? @rmg55

RichardScottOZ commented 3 years ago

and DS['blue'].quantile(0.5,dim='time') ValueError: dimension time on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single dask array chunk along this dimension, i.e., .chunk(time: -1), or pass allow_rechunk=True in dask_gufunc_kwargs but beware that this may significantly increase memory usage.

which I am not sure about - or how good the quantile would be either - maybe ok if a mosaic of thousands for a whole state?

rmg55 commented 3 years ago

Hi @RichardScottOZ and @bluetyson

Not sure I am fully following your comments, but will try to add my input (perhaps this is related to your posts on the Pangeo Discourse page?). Sounds like you want to temporal stack and mosaic a bunch of images via .vrt files. Can you expand on what you mean by "median mosaic". If you are referring to the spatial mosaicing in a vrt file, you should be able to do this with a numpy function written into the vrt file (see here - https://gis.stackexchange.com/questions/324602/averaging-overlapping-rasters-with-gdal). I should note that I have very little experience composing python functions in vrt files, so I don't really know all the tips/tricks yet. If you develop a good workflow - it would be great if you wanted to share it in this thread!

@scottyhq - still planning on working on this at some point, but just have not been able to carve out any time yet.

RichardScottOZ commented 3 years ago

Median mosaic - for example, get all the sentinel 2A images over an area for a time period, take the median for each pixel as a representative composite.

xarray or vrt - I don't mind, but having options would be good

rmg55 commented 3 years ago

@RichardScottOZ - ok got it. My thinking would be to rechunk to the time-intervals, and then apply a map_blocks function that calculates the median.

bluetyson commented 3 years ago

Thanks very much Rowan, do you have an example of such?

TomAugspurger commented 3 years ago

For those following this, but not https://github.com/pangeo-data/cog-best-practices/issues/4

Here's the repo: https://github.com/stac-utils/stac-vrt and docs: https://stac-vrt.readthedocs.io/en/latest. Pretty rough right now (likely only works for NAIP STAC items / COGs in Azure), but happy to hack on it with people.