isciences / exactextract

Fast and accurate raster zonal statistics
Apache License 2.0
233 stars 31 forks source link

exactextract Python, no coverage_fraction? #125

Open ZZMitch opened 1 week ago

ZZMitch commented 1 week ago

Hello,

I have a shapefile (GeoDataFrame in Python) that I want to convert to raster (e.g., xarray in Python) with a certain pixel grid with each pixel quantifying the % coverage of that shapefile for the pixel area. After some searching, I came across exactextract.

I was able to able to install the Python version of exactextract using pip install exactextract and confirmed it is functioning properly:

import exactextract as ee
ee.exact_extract(rast = xras, vec = gdf, ops = 'mean')

Where xras is an xarray DataArray with the pixel grid I want to match and gdf is the GeoDataFrame where I want the % coverage. I know this test code does not get that result, but it does show the package is installed and working.

From what I can tell, there is no equivalent of coverage_fraction for the current Python version of exactextract?

I know they Python version is pretty new, so wanted to confirm this functionality is not yet available. If not, does anyone have any options for achieving this result currently in Python? I know about rasterio's rasterize but it is classifies pixels based on centroids or intersection ("all_touched"), rather than % coverage.

Thanks!

dbaston commented 1 week ago

At the moment, this function only exists on the R side. It would not be difficult to expose in Python, but I haven't done so yet.

Here are some questions that come to mind on the implementation.

A limitation of the R function is that it returns a list of in-memory rasters, which is simple but greatly limits the size of the data you can process. Should Python do the same thing, and if so, what format? A list of GDAL in-memory rasters? A list of NumPy matrices with associated extents?

Alternatively (or additionally) should it write the outputs directly to disk to avoid memory limitations? As individual rasters - how would they be named? Or as a multiband raster with one band per input feature?

ZZMitch commented 1 week ago

Hey @dbaston, thanks for the information!

I am not sure how it would be applied, but I enjoy using xarray(which is backed by dask and numpy) to work with larger-than-memory and multi-dimensional raster files. On the vector side there is dask support for geopandas.

For larger-then-memory operations, could there be some process that runs on chunks of the input vector file and produces raster tiles - saved on disk... that could later be combined with xarray operations? I am not sure about naming, generally I am interested in rasterizing to a single layer (e.g., not different %s for different vector attributes).

Alternatively - if the operation will use too much memory the function could just error out, and it be up the user to chunk up their vector file to a size that will fit in their memory and combine the rasterized outputs later?