microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
185 stars 8 forks source link

Ability to Download Only Data Inside Polygon #198

Closed elliot-lvs closed 1 year ago

elliot-lvs commented 1 year ago

I am currently using Planetary Computer and its APIs, and I have been working with the intersects=area_of_interest parameter, which retrieves all satellite images that intersect a given polygon. However, this parameter requires the data to be merged on the spatial dimensions, which can be computationally expensive, especially when working with larger images.

Is there a way to download only the data that is inside the polygon, without having to merge the data on spatial dimensions? This would be helpful in reducing the amount of time it takes to process the data, and make it easier to work with larger images.

TomAugspurger commented 1 year ago

How are you currently loading the data? Are you using something like rasterio?

Since the files are stored on disk as COGs, you should be able to read only the subsets of the large files where the internal tiles of the COG overlap with your area of interest.

If you clip each COG before merging, then your memory usage should be lower.

lukasValentin commented 1 year ago

@elliot-lvs we had similar problems when working with satellite imagery in the past. We solved it (alongside many other nasty things when working with EO data) in our open-source software EOdal (Earth Observation Data Analysis Library). EOdal allows you to read only those parts of the images that overlap your input geometry and then merges only these smaller subsets into a new mosaic. Btw: EOdal can be installed easily on Planetary Computer using pip install eodal. A collection of examples how to use it is available here.

Here's a small example:

url_list = []  # this should be a list of URLs, i.e., assets hrefs

from eodal.core.algorithms import merge_datasets

# read data into a RasterCollection object
merged = merge_datasets(url_list)

# or write the results directly to disk
merge_datasets(url_list, outfile='merged.tif')
TomAugspurger commented 1 year ago

Here's a snippet, based off this example: https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a#Example-Notebook

import stackstac
import rioxarray
import shapely.geometry

ds = stackstac.stack([least_cloudy_item], assets=["B03", "B04", "B05"]).squeeze()
clipped = ds.rio.clip([shapely.geometry.shape(area_of_interest)], crs="WGS84").compute()
clipped

That will clip the DataArray down to the area of interest. Once GDAL is eventually used to read the data, it will only load data from tiles of the COG intersecting with your area of interest.

I'll close this, but feel free to post any followup questions here if you have any.