Open sdtaylor opened 1 year ago
Sdtaylor, that was a good idea... From an old fellow 😉
@sdtaylor that's a great result! By constraining the problem a bit, you've found some solid speedups here. I have some technical questions about the benchmarks, and some thoughts on including in python-rasterstats...
I haven't run the code yet so I'm going on the assumption that the speedups can be credited to two things:
1) a single rasterio.rasterize
call (1 numpy allocation) vs one call per feature.
2) less raster IO, a single big read vs one read per feature
I'd love to see the breakdown of the timing. In my experience, 2) is less of an issue in real-world datasets because GDAL's caching behavior and efficient block IO mean that, in aggregate, there is little advantage to loading in-memory rasters vs relying on GDAL's RasterioIO mechanism. It largely depends on how well the raster format is laid out on disk and the IO bus or network in between. The slope.tif
is a toy test dataset does not demonstrate block IO well. Which is to say, it would be good to test this on more realistic raster datasets including s3 (COGS) to see if it holds.
The rasterize-all-at-once optimization is really a good one. Again, I'm interested in seeing more granular timing so we could find out if e.g. it was the rasterization algorithm or the numpy array allocations (and subsequent garbage collection) that are slowing things down. But that's mostly curiosity - I think it's clearly a speed win in certain cases.
This is a great optimization, but it does change the behavior so I don't think it belongs in the current zonal_stats
function. I think we should add it as a separate function with a distinct name (zonal_stats_full_coverage
?) and we can factor out common pieces later if needed.
The additional constraints are acceptable in some circumstances but users should be aware that the results will never be the same as the other algorithms. Specifically, not handling overlap of any kind (neither at the geometry or pixel level) runs the risk of "losing" features - IOW we can no longer guarantee that the set of input features will correspond 1:1 to the output stats. Worse, it means that any overlap could hide parts of other valid polygons, making the statistics invalid. For many applications of rasterstats, this would be a non-starter.
This also means that the polygon order has an impact on the result. Another caveat to document - the geometries need to be sorted in priority order if you suspect overlap. This could bias the results as geometries painted last would have slightly larger areas (they would "win" the edge pixels).
It's not clear that every use case would see improvement either; reading the entire raster is not always a good move unless you have close to 100% coverage of the area. Sparse datasets without full coverage will be wasting IO by fetching the entire raster. Just a caveat to document - applying this optimization really depends on knowing both your raster and vector dataset well.
The unconstrained memory usage is pretty concerning. Most production use cases of rasterstats that I've personally worked on involve some sort of container infrastructure like k8s where you're likely to be memory constrained and dealing with arbitrary inputs - pure speed is less important than staying within some memory bounds since that characteristic allows you to scale it economically. Example: if you can run 10 containers x 1Gb in 6 minutes vs 1 container with 64Gb for 1 minute, the horizontally-scalable-but-60x-slower solution still wins in terms of economics (assuming $ ~= GB/min). As I'm often dealing with arbitrary user-input rasters and geometries, having the ability to keep memory reasonably constrained is critical.
This optimization is firmly in the "1 raster band, many geometries" camp. There is much interest in the opposite problem, "a few geometries, thousands of raster bands" which would clearly not be optimal for this approach. Another caveat for the docs.
So, it's got different (potentially incorrect) behavior, optimized for a polygon dataset with nearly full coverage, and takes more resource usage... It's not going to replace zonal_stats
for every use case. To be completely honest, I don't see myself using this in its current form. Speed is important but correctness and scalability are moreso ... for the type of work I do which is more backend web services. If you're on mondo workstation machine and trying to crank through a big analysis in a notebook and you're manually inspecting your inputs and outputs ... this could be perfect. Different operational environments have different tradeoffs.
We could potentially modify the algorithm to solve the correctness issues somehow. Or at least produce a warning along the lines of
Input: 345 features, area 1000000 sq units Output: 326 features, area 889000 sq units (Ignoring 19 features and losing 11000 sq units of coverage)
If we could track the features that were hidden or whose footprint was altered by overlap, we could run those geometries separately. Tracking the feature shapes and running those overlapping checks might be rather expensive and negate all the optimizations though. 🤔
Another thought, this doesn't have to be an all-or-none proposition. We could split the space up into quadrants or smaller tiles and work in chunks (with some overhead to track the edges). This could help keep the memory constrained.
In any case, thanks @sdtaylor for this work. I'm happy to include it as an alternative zonal_stats function under a new function name!
I have interest in this. Our use case is composed of many geometries , analyzed against several single-band rasters.
If it's possible to cache the rasterized version of the polygons between runs, it would be even better in our scenario.
Happy to help!
it is a very healthy discussion.
I did not realize the implications of the optimization process.
Potentially this could be sorted out by changing the digitalization process of the polygons. Kind of creating one array with all cells touched by the polygons and another with the % of the cell covered by the polygon.
With that you could evolve the all_touched option to pixel_treatment=all_touched, all_inside, weighted.
Both the build in statistics and the user_function statistics should account for this possibilities.
I really liked the option of caching (saving) digitalized polygons. That would save a good chunk of processing... I discussed this with a professor in UC Davis, and he mentioned that this step normally save a lot of processing.
Matthew, as the father of this process, how complicated is it ?
Sent from Outlook for Androidhttps://aka.ms/AAb9ysg
From: George Silva @.> Sent: Tuesday, November 28, 2023 3:04:43 AM To: perrygeo/python-rasterstats @.> Cc: hardcore-code @.>; Comment @.> Subject: Re: [perrygeo/python-rasterstats] Potential refactor for large speed boost (Issue #291)
I have interest in this. Our use case is composed of many geometries , analyzed against several single-band rasters.
If it's possible to cache the rasterized version of the polygons between runs, it would be even better in our scenario.
Happy to help!
— Reply to this email directly, view it on GitHubhttps://github.com/perrygeo/python-rasterstats/issues/291#issuecomment-1828653134, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMQ3JLN5NAR3BFZWBUJTGN3YGUBPHAVCNFSM6AAAAAA3O5EUVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRYGY2TGMJTGQ. You are receiving this because you commented.Message ID: @.***>
Very interested in the performance improvements mentioned here! Are there any updates since last year?
In particular, @hardcore-code and @george-silva's point about caching the rasterized polygons would offer significant improvements for my use case where I'm recalculating zonal stats in the same locations across a time series.
Hi, I had to do some optimizations recently for zonal stats on a very large number of polygons, and I thought the method might fit well here.
Essentially, instead of iterating thru each feature one at time, I use the rasterio.features.rasterize function to label each pixel of a raster with an id for all features. Then some array reshaping is used to create a sorted masked array, and all stats can then be generated in a vectorized fashion. It offers huge performance boosts when feature counts become large (see figure below). I setup just a few stats (['mean','max','min','sum','std']) to compare with the original zonal method just as a proof of concept.
Some downsides of this:
You can see the code here https://github.com/sdtaylor/python-rasterstats/tree/zonal_refactor_poc
And a testing/comparison script here below.
@perrygeo any interest in having this in rasterstats somehow?
Test code