isciences / exactextractr

R package for fast and accurate raster zonal statistics
https://isciences.gitlab.io/exactextractr/
281 stars 26 forks source link

Possibility to add additional Statistics #30

Closed lbusett closed 4 years ago

lbusett commented 4 years ago

Hi,

thanks for this: really helpful and fast!

Since I am in the process of "switching" my "R" packages to usage of exact_extractr (abandoning velox, or "standard" raster::extract): would it be possible/useful to include additional statistic indicators, such as variance/standard deviation/variation coefficient ? I know I can extract all "pixel" values and then compute them myself, but I think having them ready-made could be useful.

HTH,

Lorenzo

dbaston commented 4 years ago

Yes, I can add variance etc. soon.

dbaston commented 4 years ago

So, I'm a bit unsure of how to handle coverage fraction when calculating variance.

I'm using this dummy data:

values <- c(3.4, 2.9, 1.7, 8.8, -12.7, 100.4, 8.4, 11.3)
weights <- c(1.0, 0.1, 1.0, 0.2,  0.44,   0.3, 0.3, 0.83)

I've implemented formula WV2 from West (1979) and am trying to check it against other implementations. I've seen this weighted.var referenced a couple of times. There are also implementations in various R packages.

> weighted.var(values, weights)
[1] 817.1162
> Hmisc::wtd.var(values, weights)
[1] 882.4848
> Hmisc::wtd.var(values, weights, normwt=TRUE)
[1] 817.1162
> Weighted.Desc.Stat::w.var(values, weights)
[1] 670.8578
> SDMTools::wt.var(values, weights)
[1] 817.1162

I've tentatively gone with the version resulting in 817, though I'm not confident of this. The 817 comes from the weights being interpreted as "reliability weights", 882 comes from them being interpreted as "frequency weights", and 670 is a population variance.

Any thoughts on the correct choice here?

dbaston commented 4 years ago

Looks like ArcGIS is using the population value: https://desktop.arcgis.com/en/arcmap/10.3/tools/spatial-analyst-toolbox/h-how-zonal-statistics-works.htm#GUID-CB4BBDF2-7A34-427D-A8EA-1BA1C0DB265E

lbusett commented 4 years ago

@dbaston

I am no statistician, so take this for what is worth... From a quick look, the "frequency weights" option does not seem to be useful in this use case, since weights are interpreted as counts of the values of the distribution:

 values <- c(1, 1, 10)
  weights <- c(1,1, 10)
  Hmisc::wtd.var(values, weights)
[1] 12.27273
 var(c(1,1, rep(10,10)))
[1] 12.27273

About the other two alternatives, I really do not know which one would be better...

dbaston commented 4 years ago

I've gone ahead and implemented population value for consistency with ArcGIS and QGIS. I'll wait for someone to request the others if there's an application for them.

lbusett commented 4 years ago

Thanks!