Closed elliot-lvs closed 1 year ago
How are you currently loading the data? Are you using something like rasterio?
Since the files are stored on disk as COGs, you should be able to read only the subsets of the large files where the internal tiles of the COG overlap with your area of interest.
If you clip each COG before merging, then your memory usage should be lower.
@elliot-lvs we had similar problems when working with satellite imagery in the past. We solved it (alongside many other nasty things when working with EO data) in our open-source software EOdal (Earth Observation Data Analysis Library). EOdal allows you to read only those parts of the images that overlap your input geometry and then merges only these smaller subsets into a new mosaic. Btw: EOdal can be installed easily on Planetary Computer using pip install eodal
. A collection of examples how to use it is available here.
Here's a small example:
url_list = [] # this should be a list of URLs, i.e., assets hrefs
from eodal.core.algorithms import merge_datasets
# read data into a RasterCollection object
merged = merge_datasets(url_list)
# or write the results directly to disk
merge_datasets(url_list, outfile='merged.tif')
Here's a snippet, based off this example: https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a#Example-Notebook
import stackstac
import rioxarray
import shapely.geometry
ds = stackstac.stack([least_cloudy_item], assets=["B03", "B04", "B05"]).squeeze()
clipped = ds.rio.clip([shapely.geometry.shape(area_of_interest)], crs="WGS84").compute()
clipped
That will clip the DataArray down to the area of interest. Once GDAL is eventually used to read the data, it will only load data from tiles of the COG intersecting with your area of interest.
I'll close this, but feel free to post any followup questions here if you have any.
I am currently using Planetary Computer and its APIs, and I have been working with the intersects=area_of_interest parameter, which retrieves all satellite images that intersect a given polygon. However, this parameter requires the data to be merged on the spatial dimensions, which can be computationally expensive, especially when working with larger images.
Is there a way to download only the data that is inside the polygon, without having to merge the data on spatial dimensions? This would be helpful in reducing the amount of time it takes to process the data, and make it easier to work with larger images.