pangeo-data / cog-best-practices

Best practices with cloud-optimized-geotiffs (COGs)
BSD 3-Clause "New" or "Revised" License
77 stars 9 forks source link

Mosaic multiple rasters #4

Open TomAugspurger opened 3 years ago

TomAugspurger commented 3 years ago

I'm investigating the best way to mosiac multiple arrays into a single array.

https://desktop.arcgis.com/en/arcmap/10.3/manage-data/raster-and-images/what-is-a-mosaic.htm has a nice description. I'll follow up with what I learn. If anyone has additional resources I'd love to hear about them.

scottyhq commented 3 years ago

@TomAugspurger - to get you started the command line GDAL approach is typically to use https://gdal.org/programs/gdalbuildvrt.html (or rio merge https://rasterio.readthedocs.io/en/latest/api/rasterio.merge.html) , which have a few common built-in options for overlapping pixel selection rules (first, last, min, max)

If you want some sort of custom pixel selection scheme a python script is the way to go. I really like Xarray syntax for creating mosaics. See this notebook where I load a stack of many swath images into an xarray dataarray, then extract a monthly mean mosaic (https://github.com/scottyhq/sentinel1-rtc/blob/main/Sentinel1-RTC-example.ipynb), in particular: mosaic_mean = da.where(da!=0).resample(time='1m').mean()

And ongoing discussion in intake-stac to make this process simpler https://github.com/intake/intake-stac/issues/65

TomAugspurger commented 3 years ago

Thanks for sharing that @scottyhq, it's very helpful.

Question about building the vrt. I don't see the code for generating the file with the list of paths s3paths.txt. Is it something roughly like this?

vsis3paths = [x.replace("s3://", "/vsis3/") for x in s3Paths]
with open("s3paths.txt", "w") as f:
    f.write("\n".join(vsis3paths))

and then using the CLI like you did, or Python API?

import gdal

options, *_ = gdal.BuildVRTOptions(separate=True)

vrtName = f'stack{zone}{latLabel}{square}-3.vrt'
out = gdal.BuildVRT(vrtName, vsis3paths, separate=True)

Just using the keys s3:// didn't work, since gdal wants it to use its /vsis3 URL.

bluetyson commented 3 years ago

Thanks very much for all this work... state scale sentinel mosaics are what I am aiming for. Starting with a median I think.

RichardScottOZ commented 3 years ago

So starting with a STAC of sentinel 2 cogs - get catalog info and a dataframe to get the datetimes The GDAL environment variables are important? For metatadata/filesize etc. - e.g. 250MB for a sentinel 2A single band?

RichardScottOZ commented 3 years ago

https://github.com/RichardScottOZ/dask-era5/blob/main/notebook/pangeoFargateTrialSA.ipynb https://github.com/RichardScottOZ/dask-era5/blob/main/notebook/PANGEO-Sentinel-Intake-SA-OZLocation.ipynb

RichardScottOZ commented 3 years ago

Something i have been wondering being new to this, does computing a list of datasets try to bring all that into memory?

rmg55 commented 3 years ago

@TomAugspurger - if trying to stack rasters (typically time - creating dims {x,y,t,band}) - I have used three approaches with STAC collections (alluded to in @scottyhq post above).

  1. If the STAC catalog has the "proj" extension (shape + transform) - then create vrts with the metadata in the catalog and then mosaic with gdal.BuildVrt. Gist here: https://gist.github.com/rmg55/1acf804ef1af0c7934b265b3a653a486
  2. If the STAC catalog does not have the "proj" extension (shape + transform) - then open/read the metadata directly from the file - this can be slow.
  3. If the STAC items have sidecar xml files, than read the xml file to acquire the needed metadata to create the vrt files. Then mosaic with gdal.BuildVrt. Gist here (work in progress fyi): https://gist.github.com/rmg55/3e30dbb1afeee80e8c1d68db40c3b56c

I think this approach would also work with spatial mosaics as well (possibly using pixel functions to mosaic overlapping rasters), but that is something I have not yet worked on.

I would be interested to hear more about the solution/workflow you develop for mosaicing/stacking. I think this is a common bottlekneck in working with EO data (Landsat, Sentinel, Aster, etc....). Adding this sort of functionality to intake-stac would be great, but I have not been able to carve out any time to work on that yet.

TomAugspurger commented 3 years ago

Thanks for sharing @rmg55. I'll be sure to update this thread once we get a design together.

I'm glad to hear about creating a VRT from the Catalog items. I've been pleased with building VRTs from the COGs themselves, but as you say this can be a bit slow.

TomAugspurger commented 3 years ago

Quick status update: I have a prototype for building a VRTs from a STAC query (thanks for the idea and code @rmg55). This VRT can then be fed into rioxarray.open_rasterio.

This approach can be much faster than using gdal.BuildVRT, since it doesn't have to open any datasets; it can do it all from the STAC metadata as long as the required fields are present (most of the fields in the proj extension: https://github.com/radiantearth/stac-spec/blob/cc6c309d72c24cbaae802709ed58c2bf1e231f37/extensions/projection/README.md).

I'll share more soonish.

TomAugspurger commented 3 years ago

Here's the repo: https://github.com/stac-utils/stac-vrt and docs: https://stac-vrt.readthedocs.io/en/latest. Pretty rough right now (likely only works for NAIP STAC items / COGs in Azure), but happy to hack on it with people.

rmg55 commented 3 years ago

Wow - this looks great @TomAugspurger. Will dig into it deeper and see if/how i might be able to contribute. I think this has the potential to really improve a lot of existing workflows. Seems like getting existing/new catalogs to incorporating the STAC projection extension would really help this effort.

gjoseph92 commented 3 years ago

Hey folks! Haven't met many of you before, but I've been tinkering with this question as well, of how to go from STAC items to a ndarray-ish stack of imagery. Sorry for not getting in touch sooner!

So I came up with this library: https://github.com/gjoseph92/stackstac (docs at https://stackstac.readthedocs.io). The overall idea is to generate a dask array representing the whole stack of imagery—only using the STAC metadata, to avoid the latency of opening datasets. But I've also added a bit of logic to:

1) Manage GDAL's (lack of) concurrency model within Dask's threadpools, offering fully multi-threaded (no locks) reads on GeoTIFFs automatically, and falling back to thread-safe (but locked) reads with other drivers. The multi-threaded reads are kinda cheating, since we just open a new thread-local copy of the dataset for each thread, which does have some memory cost, but at least we keep the latency low by carefully ensuring that dataset-open requests remain in the VSI_CACHE. 1) Make the dask graph construction fast (see https://github.com/dask/dask/issues/5913 for how repeated da.stack doesn't scale well) by ensuring the graph is an un-materialized Blockwise layer (which is basically constant-time to generate). The goal is to be able to create xarrays pointing to terabytes/petabytes of imagery in a few milliseconds, and (eventually) have slicing be just as fast, so you can work in the xarray data model the whole time without worrying about pre-filtering your data for performance. 1) Make warping data to a common grid straightforward, and in many cases automatic. Using only STAC metadata, we try to pick a common CRS/resolution, and all reads are warped to that grid through a VRT. If the metadata is missing, or conflicts, or you want something different, specifying the CRS and resolution of the output array is easy. 1) Put as much STAC metadata as possible into xarray coordinates, so you can index and filter by eo:cloud_fraction, time, spatial coordinates, etc. 1) Be modular to new backends. Currently, rasterio is the only one, but I'd like to experiment with going no-GDAL when possible via aiocogeo or geotiff.

There's a ton of overlap here with @TomAugspurger's work on stac-vrt, and @scottyhq + @rmg55's work on intake-stac (https://github.com/intake/intake-stac/pull/75). I think we're all working on similar problems, coming from slightly different assumptions of what the problems are. I'd love to hear if you think what I've come up with is helpful for your use-cases (@RichardScottOZ, curious if this is helpful for your median composites), or if not, how we can use bits of it to build something that works for everyone.

RichardScottOZ commented 3 years ago

Thanks, sounds interesting - I will take a look!

TomAugspurger commented 3 years ago

Thanks for sharing @gjoseph92! Making a "dask native" way of doing this was on my list too, as these VRTs can get somewhat large. I'll take a closer look later.

rmg55 commented 3 years ago

@gjoseph92 - awesome project and agreed there is a lot of overlap. Hopefully I will have some more time to explore the project over the next week. I had some quick questions that I was hoping you might be able to clarify:

gjoseph92 commented 3 years ago

@rmg55 thanks for the questions!

it looks like there is a method to build the transform (affine) from the GSD and BBox

Not quite—I don't use gsd, but in some cases yes, I do use bbox (if proj:bbox is missing). Which indeed could be less accurate. (I should switch from reprojecting bbox to reprojecting geometry, then taking its bbox in the output CRS.)

But overall, it's actually okay if it's a little inaccurate, because we're just estimating a good output transform. When actually reading the data, GDAL will still use whatever geotrans is set on the dataset, then warp it to the output transform we picked. So it's never "inaccurate", per se (pixels will always be correctly georeferenced), but it's true that the output resolution might not always match the native resolution of the data by default. But since getting things to a common grid is the goal anyway, that's kind of the point. And if you have a specific resolution and CRS you want, you're always free to set that too.

Any plans to incorporate the mosaic workflow

I opened https://github.com/gjoseph92/stackstac/issues/1 to discuss this, but the short answer is: you can instead do the mosaicking with some np.wheres on the dask array, which IMO has a few advantages over baking it into the VRT at the GDAL level.

scottyhq commented 3 years ago

Thanks for opening this up @gjoseph92, I had a brief look at stackstac (love the name by the way, and couldn't help thinking the palindrome stackcats would also be fun :) I'm really impressed by the docs and appreciate your links to the dask scaling issues. There is definitely a lot of overlap, but to be honest I think you're a fair ways ahead on the path that we've just been starting down! Will definitely be good to think about how to consolidate effort and collaborate where possible on this, I think it's a really exciting time to push these tools forward with STAC1.0 coming out and more and more COGs+STAC available to make the most of.

TomAugspurger commented 3 years ago

Anyone else interested in joining a call to hash out next steps? I feel like we've had a good amount of exploration in the area of STAC + xarray + Dask, but we might be due for some consolidation, or at least understanding where different projects have different visions or different approaches.

If you are interested, give a +1 to this and I'll send out an invite sometime later this week.

TomAugspurger commented 3 years ago

cc @snowman2 as well: https://github.com/pangeo-data/cog-best-practices/issues/4#issuecomment-799562484, if you're interested in joining a call on this topic. https://github.com/pangeo-data/cog-best-practices/issues/4#issuecomment-796416909 is a good summary.

I'll send out a doodle poll tomorrow to find a time for next week. I suspect we'll cover STAC + Dask + xarray first (and stac-vrt / stackstac specifically). But we'll inevitably spill into larger operations so it'd be good to have someone from rioxarray and xarray-spatial there. (and maybe @jorisvandenbossche or someone else from geopandas, if we get into vector / raster operations? Sounds like a full meeting 😄)

TomAugspurger commented 3 years ago

I sent the following email to people who indicated interest. Anyone is welcome to join.

Link to the doodle poll: https://doodle.com/poll/kpz9rtrqmk4tiqy2. I think we have a pretty wide range of timezones, but I gave options in hourly increments for the US working day for next week. I think we'll use an hour productively, and people of course are welcome to come and go.

Here's a link to a proposed agenda: https://docs.google.com/document/d/1IXnk2fvpjWaRZHAwZ1i_d6l0B7iIvV_6Kmjy2IXK5Vo/edit?usp=sharing. Feel free to add other items.

scottyhq commented 3 years ago

ping @dcherian and @JessicaS11 would be great if you have availability to join!

rbavery commented 3 years ago

When folks converge on a solution for reading STAC rasters into xarray Datasets, I'd like to showcase that solution in this lesson: https://github.com/carpentries-incubator/geospatial-python It's geared toward novice and intermediate programmers who do GIS/remote sensing. I'll attend mostly to listen.

vincentsarago commented 3 years ago

👋 I'm a bit late here, but Mosaic (+ STAC) is something we have worked a lot in the last years. rio-tiler (which I'm one of the core dev) has native support for mosaic (and for STAC), you may found this example interesting https://cogeotiff.github.io/rio-tiler/examples/Using-rio-tiler-mosaic/

On the mosaic side, we have developed a simple spec: https://github.com/developmentseed/mosaicjson-spec. The spec was made to support dynamic tiling, but help created https://github.com/developmentseed/cogeo-mosaic which can in fact work for other purpose than map tiles but we recently introduced the concept of dynamic mosaic which use API or database to store the COG geometry informations and use the cogeo-mosaic backend to create data array for the list of intersecting COG: https://developmentseed.org/cogeo-mosaic/examples/Create_a_Dynamic_StacBackend/

rio-tiler/cogeo-mosaic are just one solution for some specific problem and it's nice to see other tooling begin developed on this subject.

TomAugspurger commented 3 years ago

Looks like Monday the 22nd at 1:00 US Central time worked for the most people (reminder: the US just had a daylight savings time transition). I sent an invite to people whose email I had. If you're interested in joining send me an email at taugspurger@microsoft.com and I'll add you, or just show up then.

Agenda: https://docs.google.com/document/d/1IXnk2fvpjWaRZHAwZ1i_d6l0B7iIvV_6Kmjy2IXK5Vo/edit?usp=sharing

Call link: https://teams.microsoft.com/l/meetup-join/19%3ameeting_NTVjN2YwNjktNzU0My00MTM2LWIwZmQtZDBiYjNhMzY4ZTk2%40thread.v2/0?context=%7b%22Tid%22%3a%2272f988bf-86f1-41af-91ab-2d7cd011db47%22%2c%22Oid%22%3a%22ee57c564-493d-43d5-a2c8-537f22cd9501%22%7d

TomAugspurger commented 3 years ago

I updated https://github.com/pangeo-data/cog-best-practices/issues/4#issuecomment-803301629 with a link to the video call.

TomAugspurger commented 3 years ago

Thanks to everyone who joined. The agenda doc has some detailed notes for those who missed: https://docs.google.com/document/d/1IXnk2fvpjWaRZHAwZ1i_d6l0B7iIvV_6Kmjy2IXK5Vo/edit?usp=sharing

Some takeaways:

  1. We're broadly in agreement that it'd be great if libraries doing I/O (rioxarray, a hypothetical library that has the I/O component of stackstac) exposed geospatial information consistently. Then downstream analysis / visualization libraries can rely on this information, regardless of how the data was loaded. https://corteva.github.io/rioxarray/stable/getting_started/crs_management.html has a note on how rioxarray handles this
  2. Both stackstac and stac-vrt are narrowly scoped. There's interest in a library that handles a wider set of functionality. It's not clear (to me) how to scope that library, but at a minimum things like stacking a time-series of images and making a mosaic of many images are in scope, which implies that the kind of I/O stackstac had to implement is likely in scope. What else?

Anything else to add?

orianac commented 3 years ago

Hi all, First time joining a github conversation (!) but just thought I'd jump in as a way to keep an eye on where this goes. Very interested as I've been banging my head against the wall trying to use these tools to mosaic rasters for a while now. Thanks for everyone's contributions to it!

RichardScottOZ commented 3 years ago

Median composites...no median in dask, how is this working? A 'make me one of these' from whatever generic or user function mean, median, mode, hdstats once parallelized et al would be beneficial?

RichardScottOZ commented 3 years ago

Improved performance...eg stackstac

RichardScottOZ commented 3 years ago

Plotting of large outputs/dimensions

RichardScottOZ commented 3 years ago

Thanks for the notes for the middle of the night people

pl-marasco commented 3 years ago

Thanks, @TomNicholas to bring the issue here.

A little bit more context, for an institutional reason I'm not able to use the Synergise/AWS products. This brings me to deal with the overlapping issue that has been already "solved" (with an average ? I've no info about this)in the AWS product.

...this issue ...

Has been more accurately described here Seems that the best option to solve this is, as suggested by @JessicaS11 , is to use rioxarray.merge but I'm still struggling to create a full mosaic over a large time period.

gjoseph92 commented 3 years ago

@pl-marasco see https://github.com/gjoseph92/stackstac/issues/1 for a quick mosaic implementation (not quite what you're looking for, since it doesn't take coordinates into account, but may work).

I'm still struggling to create a full mosaic over a large time period

stackstac may be helpful for this task in general.

RichardScottOZ commented 3 years ago

Yes, don't think I could have done the SA dataset without an AWS (GCS/Azure) or whatever machine with 300+ GB of RAM.

ErinKenna commented 2 years ago

Hi, Just passing through looking for answer to something but thought I'd mention this in case it is relevant/of interest. https://github.com/opendatacube/odc-stac/tree/odc-geo

snowman2 commented 2 years ago

Hi, Just passing through looking for answer to something but thought I'd mention this in case it is relevant/of interest. https://github.com/opendatacube/odc-stac/tree/odc-geo

Updated location: https://github.com/opendatacube/odc-geo