Open groutr opened 1 year ago
A quick benchmark using some datasets from your blog @perrygeo (https://www.perrygeo.com/zonal-stats-with-postgis-rasters-part-2.html)
import time
import rasterstats
print(rasterstats)
from rasterstats import zonal_stats
start = time.perf_counter()
stats = zonal_stats(
"ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp",
"wc2.1_2.5m_tavg/wc2.1_2.5m_tavg_07.tif",
stats=["unique", "range", "std", "sum", "mean", "count", "std", "min", "max", "median", "percentile_.1", "percentile_.5", "percentile_.9"]
)
print(time.perf_counter() - start)
On my laptop, the latest version of rasterstats runs in 11.847925074999694s. With this PR, that runtime drops to 8.6056675789996s
Initial testing
zonal_stats(polygon, raster, stats=["min", "max", "count"])
command time --verbose python bench.py
master (fastest, lowest memory footprint)
Maximum resident set size (kbytes): 2458736
Elapsed (wall clock) time: 0:15.30
groutr:perf1
Maximum resident set size (kbytes): 2742956
Elapsed (wall clock) time: 0:22.77
So as I suspected, in the case where the user is requesting only basic stats, this PR will increase the memory footprint (in this case +12%) and the run time (+48%)
There are a significant number of duplicates in this dataset (since the tif is integer values of elevation, highly spatially autocorrelated) so I'd suspect that datasets with larger integers or floats would have more unique values and the numbers would be worse. But for now, we'll just focus on this dataset...
Let's add unique to invoke np.unique
calls in both
zonal_stats(polygon, raster, stats=["min", "max", "count", "unique"])
command time --verbose python bench.py
master: (lowest memory footprint)
Maximum resident set size (kbytes): 2458940
Elapsed (wall clock) time: 0:25.61
groutr:perf1 (fastest)
Maximum resident set size (kbytes): 2743124
Elapsed (wall clock) time: 0:22.59
It's a draw. This PR is faster at the expense of an increased memory footprint.
Let's try it with every statistic we can calculate:
zonal_stats(polygon, raster, stats="*")
command time --verbose python bench.py
master:
Maximum resident set size (kbytes): 3435952
Elapsed (wall clock) time: 0:35.89
groutr:perf1 (fastest, lowest memory footprint)
Maximum resident set size (kbytes): 3149592
Elapsed (wall clock) time: 0:25.81
This PR shines :sparkles:
There's no definitive answer, only tradeoffs :-)
This PR's approach is faster with a lower memory footprint, if all the stats are requested.
This PR's approach is slower with an increased memory footprint, when using a small subset of stats and fewer duplicate pixel values.
For this dataset, the number of unique values was 4 orders of magnitude lower than the total count, ie plenty of duplicates. We can intuit that the memory footprint and performance of this PR would worsen as the number of unique values approached the total count. If I have time in the future, I'd like to test with a floating point grid of random noise to simulate the worst case scenario.
We know that this PR improves performance for certain cases but introduces performance regressions in other cases; not a clear win over the current approach. The question is: On average, do the benefits outweigh the negatives across a wide variety of use cases?. I'm going to need to see more benchmark data from real-world datasets to make the decision.
@groutr we could also try to make this np.unique
approach optional so that users could opt-in to this behavior if they knew their datasets would benefit from it.
@groutr Would you like to split the percentile computation into a separate PR? That seems like a clear win to me and I don't want to hold that part up due to the np.unique issues.
@perrygeo I will need to dig up my real-world test case today. IIRC, the changes in the PR yielded a dramatic performance increase.
@groutr Would you like to split the percentile computation into a separate PR? That seems like a clear win to me and I don't want to hold that part up due to the np.unique issues.
Sure!
@perrygeo My real-world test case was flawed. The performance difference was largely accounted for by a change in formats that escaped my notice.
However, I've been thinking about this further and wanted to hear you thoughts about the following idea. What if the stats were broken out into functions that were called on the zone. For example, requesting the minimum of a zone would call a function that found the minimum and returned it. This would allow for user-defined functions to be given that could be tailored to specific datasets allowing for computations to be optimized or even parallelized. The main loop of rasterstats would simply orchestrate the execution of these functions and provide the appropriate input data. At a very high level, it would map a list of functions over the zone data and return the collected results from each function.
A small collection of performance enhancements aimed at reducing the number of times the data is traversed.
Also improved percentile computation by computing them all at once instead of one at a time (and avoid recreating the compressed array each time).