pysal / tobler

Spatial interpolation, Dasymetric Mapping, & Change of Support
https://pysal.org/tobler
BSD 3-Clause "New" or "Revised" License
145 stars 30 forks source link

REF: use STRtree.query_bulk in _area_tables_binning #110

Closed martinfleis closed 3 years ago

martinfleis commented 3 years ago

xref #104

This is the latest version of single-core refactoring of area_interpolate. It turned out, that using pygeos-backed STRtree and its vectorized query_bulk in geopandas is even faster than the numba version @darribas mentioned in #104.

While pygeos is not strictly necessary and geopandas can use rtree for this operation, pygeos will bring speedup of the order of magnitude so I am adding it to requirements to make sure user have it installed.

codecov-io commented 3 years ago

Codecov Report

Merging #110 (c652fb2) into master (c10dbda) will decrease coverage by 6.72%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #110      +/-   ##
==========================================
- Coverage   41.34%   34.62%   -6.73%     
==========================================
  Files          11       11              
  Lines         549      491      -58     
==========================================
- Hits          227      170      -57     
+ Misses        322      321       -1     
Impacted Files Coverage Δ
tobler/area_weighted/area_interpolate.py 41.57% <100.00%> (-13.94%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c10dbda...c652fb2. Read the comment docs.

martinfleis commented 3 years ago

Timings:

from tobler.area_weighted import area_interpolate
import geopandas

p = ("https://geographicdata.science/book/_downloads/"\
     "f2341ee89163afe06b42fc5d5ed38060/sandiego_tracts.gpkg")
src = geopandas.read_file(p)

p = ("https://geographicdata.science/book/_downloads/"\
     "d740a1069144baa1302b9561c3d31afe/sd_h3_grid.gpkg")
tgt = geopandas.read_file(p).to_crs(src.crs)

%timeit estimates = area_interpolate(src, tgt, ['total_pop'])

master - 2.4 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) this PR - 394 ms ± 35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

darribas commented 3 years ago

+1 on going ahead with this approach. After several experiments and thinking, my sense with the issue is that the second test case was different balance of polygons, but also polygons were distributed differently over space, giving some bins very few but others a lot of polygons to check, and making the binning approach less efficient. Think of the case of western MSAs in the US, if we bin uniformly, the coastal buckets will have a lot more than the inlands... Not sure if that's the case, but it's a hunch.

More generally, what we need for this operation is a spatial index and, given the ecosystem now has very good off-the-shelf options, my vote would be to jump on that wagon and enjoy further improvements down the line.

martinfleis commented 3 years ago

@knaaptime there is a test_dasymetric.py file in the root folder, I guess that is not an intention :).

knaaptime commented 3 years ago

@martinfleis yep, not sure how that got there...

In general though this is greatT dramatically simplifies the code too. I'm all for it

sjsrey commented 3 years ago

Excellent improvement!