Closed martinfleis closed 3 years ago
Merging #110 (c652fb2) into master (c10dbda) will decrease coverage by
6.72%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #110 +/- ##
==========================================
- Coverage 41.34% 34.62% -6.73%
==========================================
Files 11 11
Lines 549 491 -58
==========================================
- Hits 227 170 -57
+ Misses 322 321 -1
Impacted Files | Coverage Δ | |
---|---|---|
tobler/area_weighted/area_interpolate.py | 41.57% <100.00%> (-13.94%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c10dbda...c652fb2. Read the comment docs.
Timings:
from tobler.area_weighted import area_interpolate
import geopandas
p = ("https://geographicdata.science/book/_downloads/"\
"f2341ee89163afe06b42fc5d5ed38060/sandiego_tracts.gpkg")
src = geopandas.read_file(p)
p = ("https://geographicdata.science/book/_downloads/"\
"d740a1069144baa1302b9561c3d31afe/sd_h3_grid.gpkg")
tgt = geopandas.read_file(p).to_crs(src.crs)
%timeit estimates = area_interpolate(src, tgt, ['total_pop'])
master - 2.4 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this PR - 394 ms ± 35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+1 on going ahead with this approach. After several experiments and thinking, my sense with the issue is that the second test case was different balance of polygons, but also polygons were distributed differently over space, giving some bins very few but others a lot of polygons to check, and making the binning approach less efficient. Think of the case of western MSAs in the US, if we bin uniformly, the coastal buckets will have a lot more than the inlands... Not sure if that's the case, but it's a hunch.
More generally, what we need for this operation is a spatial index and, given the ecosystem now has very good off-the-shelf options, my vote would be to jump on that wagon and enjoy further improvements down the line.
@knaaptime there is a test_dasymetric.py
file in the root folder, I guess that is not an intention :).
@martinfleis yep, not sure how that got there...
In general though this is greatT dramatically simplifies the code too. I'm all for it
Excellent improvement!
xref #104
This is the latest version of single-core refactoring of
area_interpolate
. It turned out, that using pygeos-backedSTRtree
and its vectorizedquery_bulk
in geopandas is even faster than the numba version @darribas mentioned in #104.While pygeos is not strictly necessary and geopandas can use
rtree
for this operation, pygeos will bring speedup of the order of magnitude so I am adding it to requirements to make sure user have it installed.