locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
244 stars 45 forks source link

Group values by nearest decimal place rather than integer in histograms #519

Closed courtney-layman closed 3 years ago

courtney-layman commented 3 years ago

In the rf_agg_approx_histogram and rf_tile_histogram would it be possible to add an option to group values to the nearest decimal place rather than integer? After performing z-score normalization on some Landsat images, I am seeing some really sparse histograms. It would be useful to be able to group by float values with 1 decimal place.

vpipkt commented 3 years ago

For context, first I want to confirm this is cases with a floating point cell_type?

And by sparse histogram, this means that many bins have counts that are very low?

Can you confirm the issue is happening on rf_tile_histogram? the implementation under the hood should already be returning float valued breaks.

To answer your question more directly, at this point it would be a pretty big implementation effort to make such an option available, especially in the case of the aggregate histogram.

metasim commented 3 years ago

The Scala API provides the option to set the number of buckets. Would this help if available in Python?

https://github.com/locationtech/rasterframes/blob/19191ba6c13bc6be0b92e50d5b2914e976b28483/core/src/main/scala/org/locationtech/rasterframes/functions/AggregateFunctions.scala#L58

courtney-layman commented 3 years ago

I think we can close this since @vpipkt found a bug in my code that was converting the floats to integer. It might be useful to have the option for setting the number of buckets though.