nens / dask-geomodeling

On-the-fly operations on geographical maps.
https://dask-geomodeling.readthedocs.org
BSD 3-Clause "New" or "Revised" License
29 stars 5 forks source link

AggregateRaster (Zonal statistics) discrepancy with QGIS Zonal statistics #43

Open tastatham opened 4 years ago

tastatham commented 4 years ago

Hi Dask-geomodeling,

I came across this package a few months back and @caspervdw very kindly provided an example for applying zonal statistics. See here for the implementation we are using.

I have spot checked the zonal statistic outputs from your package with QGIS zonal statistics and the outputs are very different. For example; comparing the zonal statistics outputed from QGIS (GHSLsum) and dask-geomodeling (ghsl) we can see there is a clear difference. ghsl The highlighted cell in yellow is a 1km2 vector grid cell and the two highlighted red cells are the ghsl raster layers (250m). Just these two raster cells alone have a VALUE of 7.29 and 32.12, which suggests the QGIS approach is more realistic.

I believe that these differences is due to the fact that the QGIS zonal statistics includes a weight (fraction of each cell covered by the polygon), that takes into account partial overlays, whereas the dask-geomodeling approach does not.

Is this assumption is correct, are you planning to add this functionality in later releases of this package?

caspervdw commented 4 years ago

Hi @tastatham , thanks for your report.

I wasn't aware on the differences in approach between QGis and dask-geomodeling. It is an interesting idea to weigh the partial pixel overlap.

Which approach is the best approach would however depend on your problem. I think this is only useful for large cell sizes, for smaller cell sizes the effect would be negligible (but the effect on performance would not be).

I don't for see that we are going to work on this functionality in the near future though. If you're up to it, we would welcome a contribution in this area.

tastatham commented 4 years ago

@caspervdw yes I agree, it's entirely dependent on your problem.

I was considering applying a function for partial overlaps for this package myself but I came across this command line tool: https://github.com/isciences/exactextract, which is written entirely in C++. As such, this implementation is very fast and accounts for on the fly computations, much like QGIS. The documentation here explains other common utilities and how they differ. I thought I would share this with you, as this could be useful your end/other users.

caspervdw commented 4 years ago

That looks very interesting, I didn't know this library. Tagging @arjanverkerk as he will probably be interested as well.

Just wondering have you tried https://rasterio.readthedocs.io/en/latest/?

tastatham commented 4 years ago

I have used rasterio but to my knowledge, rasterio does not support zonal statistics directly? https://pythonhosted.org/rasterstats/ supports this operation. Like your package, this can be ran multithreaded e.g. https://github.com/perrygeo/python-rasterstats/blob/master/examples/multiproc.py but the cool thing about your package is that it offers the ability for lazy evaluation via Dask & a geometry tiler to define how much memory to allocate.

There are several other packages I have tested. Most packages out there do not account for partial overlaps because polygon clipping is expensive. I might create a notebook comparing common approaches that are available at present.