isciences / exactextract

Fast and accurate raster zonal statistics
Apache License 2.0
246 stars 32 forks source link

Output histogram as tuple-formatted string as separate column? #44

Closed trcull closed 10 months ago

trcull commented 1 year ago

If I'm reading the code in raster_stats.h correctly, it looks like you already have the count of whatever ends up being the majority and the minority cell value. It would be TREMENDOUSLY helpful if those counts could be output, themselves, as metrics, too, as a kind of histogram. For example, I'd run something like this:

exactextract -r the_metric:/root/tmpdata/USGS_NLCD.tif[1] -p /root/tmpdata/h3_res10_dfw.shp -f h3_cell -s majority(the_metric) -s hist(the_metric) -o /root/tmpdata/USGS_NLCD.csv

Where hist() outputs a json-formattted string of value:count pairs, like this: "[[12,1.5],[15,3.0],[25,4.75]]"

This would tell me not only which value was the majority, but also what its count was and what all the other values' counts were. This would be kind of buyer-beware in the sense that it might output unhelpfully-large histograms for rasters that aren't categorical. But it's tremendously useful for categorical data.

As long as you're in there, a simple majority_count() and minority_count() metric that is the count of the value that ended up winning would be super useful and nearly free, also, and easier to work with, depending on the circumstances.

In case it's helpful, my use case is that I'm processing zonal stats across lots of disjoint raster tiles, in parallel, and overlaying an h3 grid on top of them. Then, for cases where the h3 cell partially-overlaps two different tiles, I'd like to do some post-processing to determine the "true winner" of the majority. This would be a get-out-of-jail-free card for other kinds of post-processing, too, I imagine.

I'd make an attempt at a PR myself but, sadly, it's been ~30 years since the last time I wrote C++

dbaston commented 1 year ago

I'm hoping to have resources to work on this later in 2023. For now, the histogram functionality is available via https://github.com/isciences/exactextractr. For your use case, you might also consider constructing a VRT of the disjoint tiles to avoid edge effects.

dbaston commented 10 months ago

These counts can now be obtained by multiplying count and frac.