perrygeo / python-rasterstats

Summary statistics of geospatial raster datasets based on vector geometries.
BSD 3-Clause "New" or "Revised" License
525 stars 117 forks source link

Performance optimizations #268

Open groutr opened 1 year ago

groutr commented 1 year ago

A small collection of performance enhancements aimed at reducing the number of times the data is traversed.

Also improved percentile computation by computing them all at once instead of one at a time (and avoid recreating the compressed array each time).

groutr commented 1 year ago

A quick benchmark using some datasets from your blog @perrygeo (https://www.perrygeo.com/zonal-stats-with-postgis-rasters-part-2.html)

import time
import rasterstats
print(rasterstats)
from rasterstats import zonal_stats

start = time.perf_counter()
stats = zonal_stats(
    "ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp",
    "wc2.1_2.5m_tavg/wc2.1_2.5m_tavg_07.tif",
    stats=["unique", "range", "std", "sum", "mean", "count", "std", "min", "max", "median", "percentile_.1", "percentile_.5", "percentile_.9"]
)
print(time.perf_counter() - start)

On my laptop, the latest version of rasterstats runs in 11.847925074999694s. With this PR, that runtime drops to 8.6056675789996s

perrygeo commented 1 year ago

Initial testing

Setup

Results (basic stats)

zonal_stats(polygon, raster, stats=["min", "max", "count"])

command time --verbose python bench.py

master (fastest, lowest memory footprint)

Maximum resident set size (kbytes): 2458736
Elapsed (wall clock) time: 0:15.30

groutr:perf1

Maximum resident set size (kbytes): 2742956
Elapsed (wall clock) time: 0:22.77

So as I suspected, in the case where the user is requesting only basic stats, this PR will increase the memory footprint (in this case +12%) and the run time (+48%)

There are a significant number of duplicates in this dataset (since the tif is integer values of elevation, highly spatially autocorrelated) so I'd suspect that datasets with larger integers or floats would have more unique values and the numbers would be worse. But for now, we'll just focus on this dataset...

Results (basic stats + unique)

Let's add unique to invoke np.unique calls in both

zonal_stats(polygon, raster, stats=["min", "max", "count", "unique"])

command time --verbose python bench.py

master: (lowest memory footprint)

Maximum resident set size (kbytes): 2458940
Elapsed (wall clock) time: 0:25.61

groutr:perf1 (fastest)

Maximum resident set size (kbytes): 2743124
Elapsed (wall clock) time: 0:22.59

It's a draw. This PR is faster at the expense of an increased memory footprint.

Results (all stats)

Let's try it with every statistic we can calculate:

zonal_stats(polygon, raster, stats="*")

command time --verbose python bench.py

master:

Maximum resident set size (kbytes): 3435952
Elapsed (wall clock) time: 0:35.89

groutr:perf1 (fastest, lowest memory footprint)

Maximum resident set size (kbytes): 3149592
Elapsed (wall clock) time: 0:25.81

This PR shines :sparkles:

Conclusion

There's no definitive answer, only tradeoffs :-)

This PR's approach is faster with a lower memory footprint, if all the stats are requested.

This PR's approach is slower with an increased memory footprint, when using a small subset of stats and fewer duplicate pixel values.

For this dataset, the number of unique values was 4 orders of magnitude lower than the total count, ie plenty of duplicates. We can intuit that the memory footprint and performance of this PR would worsen as the number of unique values approached the total count. If I have time in the future, I'd like to test with a floating point grid of random noise to simulate the worst case scenario.

We know that this PR improves performance for certain cases but introduces performance regressions in other cases; not a clear win over the current approach. The question is: On average, do the benefits outweigh the negatives across a wide variety of use cases?. I'm going to need to see more benchmark data from real-world datasets to make the decision.

perrygeo commented 1 year ago

@groutr we could also try to make this np.unique approach optional so that users could opt-in to this behavior if they knew their datasets would benefit from it.

perrygeo commented 1 year ago

@groutr Would you like to split the percentile computation into a separate PR? That seems like a clear win to me and I don't want to hold that part up due to the np.unique issues.

groutr commented 1 year ago

@perrygeo I will need to dig up my real-world test case today. IIRC, the changes in the PR yielded a dramatic performance increase.

groutr commented 1 year ago

@groutr Would you like to split the percentile computation into a separate PR? That seems like a clear win to me and I don't want to hold that part up due to the np.unique issues.

Sure!

groutr commented 1 year ago

@perrygeo My real-world test case was flawed. The performance difference was largely accounted for by a change in formats that escaped my notice.

However, I've been thinking about this further and wanted to hear you thoughts about the following idea. What if the stats were broken out into functions that were called on the zone. For example, requesting the minimum of a zone would call a function that found the minimum and returned it. This would allow for user-defined functions to be given that could be tailored to specific datasets allowing for computations to be optimized or even parallelized. The main loop of rasterstats would simply orchestrate the execution of these functions and provide the appropriate input data. At a very high level, it would map a list of functions over the zone data and return the collected results from each function.