isciences / exactextract

Fast and accurate raster zonal statistics
Apache License 2.0
255 stars 33 forks source link

Make include_nodata argument available for built-in operations #147

Open dbaston opened 2 weeks ago

dbaston commented 2 weeks ago

Would make count(include_nodata=true) available as a more legible alternative to e.g. count(default_value=0). Not sure if it's worth it.

theroggy commented 2 weeks ago

Not sure what the effort is, so difficult to judge the cost/benefits, but at least it is quite a bit cleaner. In quite some cases it might even be impossible to use the work around.

Finding a values that is surely not being used next to the actual nodata value will at least be a fuss, and in some cases it will even be impossible without actually recoding the data, or using a fake value if that works?

E.g. data that has been encoded to a byte, often all values are "taken", so choosing a default_value different than the actual nodata value can be "difficult". Maybe it is possible to just pick any in value... but anyway it is a hassle. Not sure why it is called default_value anyway instead of nodata or a variant.

dbaston commented 2 weeks ago

Maybe it is possible to just pick any in value... but anyway it is a hassle.

I was thinking of "count", where it doesn't matter if the value is "taken." But I guess it's also useful for "unique" and "frac".

Not sure why it is called default_value anyway instead of nodata or a variant.

I would expect nodata to specify a value to be ignored, whereas this is specifying a value to use in place of nodata. I think of it like an SQL COALESCE(value, default_value). For example, for a population raster that uses NaN for ocean cells, you would want these to be considered as 0 for most stats. The problem would really best be solved with a GDAL VRT, but it's unfortunately a bit cumbersome to construct one for this scenario.

theroggy commented 2 weeks ago

Not sure why it is called default_value anyway instead of nodata or a variant.

I would expect nodata to specify a value to be ignored, whereas this is specifying a value to use in place of nodata. I think of it like an SQL COALESCE(value, default_value). For example, for a population raster that uses NaN for ocean cells, you would want these to be considered as 0 for most stats. The problem would really best be solved with a GDAL VRT, but it's unfortunately a bit cumbersome to construct one for this scenario.

True... clear names (for everyone) are sometimes difficult to find. Once explained it does make sense... I also misunderstood how it worked, now I understand.

Adding a keyword will be clearer, but I suppose just documenting it properly with some examples should be ok as well? Keywords for many different use cases where you could use it but actually mapping to the same thing might make the API even less understandable in end, even though for this one case this is not the case yet?