Closed knaaptime closed 3 years ago
Merging #130 (86223cc) into master (6945c6d) will decrease coverage by
2.55%
. The diff coverage is33.33%
.
@@ Coverage Diff @@
## master #130 +/- ##
==========================================
- Coverage 82.52% 79.97% -2.56%
==========================================
Files 15 16 +1
Lines 767 809 +42
==========================================
+ Hits 633 647 +14
- Misses 134 162 +28
Impacted Files | Coverage Δ | |
---|---|---|
tobler/dasymetric/raster_tools.py | 31.70% <31.70%> (ø) |
|
tobler/dasymetric/__init__.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 6945c6d...86223cc. Read the comment docs.
I'm wondering if pygeos could do something to speed it up
The input here is JSON, right? I don't think there's a faster way than a loop now. GeoPandas has almost the same code in from_features
.
The input here is JSON, right? I don't think there's a faster way than a loop now. GeoPandas has almost the same code in from_features.
cool thanks. In that case, i'm pretty happy with this
ok this should be good to go. It uses joblib to parallelize the geometry creation (performance example here). There are still a few ways i could imagine parallelizing this further. We could maybe do something like rioxarray to do the clipping in lines 72-73 (though not sure if that would also return the transform object, so would need to write the intermediate rasters out to temp files, which would probably mitigate the performace gain by splitting apart.). But the slowest remaining pieces are the unary_union in the beginning and the rasterio shape creation at line 85-86 both of which could easily be parallelized if we could spatially partition the geodataframe using its spatial index or something
performance example here
That is interesting! Why do you think there's such a difference between dask and joblib? Is that dask-scheduler overhead?
This looks great. A couple of thoughts:
joblib
a hard dependency this is fine. Otherwise, you could check n_jobs=1
and run it without the parallel implementation. The code should be what happens inside a chunk, applied to the entire table.yeah dask with threads took as long or longer than the sequential approach so didnt even bother including it. I'll go ahead and update the gist for completion
That’s really interesting... I wonder why that extra overhead...
On Mon, 25 Jan 2021 at 16:21, eli knaap notifications@github.com wrote:
yeah dask with threads took as long or longer than the sequential approach so didnt even bother including it. I'll go ahead and update the gist for completion
— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/pysal/tobler/pull/130#issuecomment-766933427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADF4UZLNGKJ3HQGDUT2BQDS3WLBXANCNFSM4WN5JIUA .
--
Daniel Arribas-Bel, PhD. Url: darribas.org Mail: D.Arribas-Bel@liverpool.ac.uk
I have no idea whats up with these results, only that its clear joblib is the winner. I'd be fine making it a hard dependency.
The real trouble, though, is that the motivation for this function was to support a workflow like this one, where the resulting gdf is used to clip an input like census data, before passing the clipped data onto area_interpolate
. But the clip operation absolutely explodes when using these raster polygons as masking features.. I can take a single row of an input frame and clip that, but using the full input dataframe with a few hundred tracts takes hours and hours to complete
i think the clip speed issue i mention above is twofold.
tl;dr we can get around it by defining a more efficient clip function (which could also be parallelized)
def iterclip(source, mask):
tree = mask.sindex
features = []
for i, row in source.iterrows():
query = tree.query(source.geometry.iloc[i], predicate='intersects').tolist()
submask = mask[mask.index.isin(query)]
single = gpd.GeoDataFrame(row.to_frame()).T
features.append(gpd.clip(single, submask))
frame = gpd.GeoDataFrame(pd.concat(features))
frame.crs = source.crs
return frame
There's a gist here that walks through the issue. The test data has 400+ census polygons and roughly 24,000 polygons extracted from a raster. Calling clip on those data directly is a non-starter because (1) calling unary_union on the 24k frame takes a long time, and (2) the unary_union of the rasters eclipses the bbox of each feature in the census frame so the spatial index isnt helping. Instead, i tried looping over each feature in the census df, taking only the raster polys that hit its spatial index and using that as a clip feature. Much faster. Is this an approach we might want to send upstream to geopandas?
Is this an approach we might want to send upstream to geopandas?
Yes please.
this PR adds two new functions:
get_gadm
is a nice convenience to grab GADM data. It can be super handy when demoing tobler, but also seems like it fits in hereextract_raster_features
converts a subset of pixels to a geodataframe. There are several ways this could be useful in the library and it's more performant than i was expecting though can still take a couple minutes for large problems. The time consuming part is this line that converts the pixel features to polygons. Again, I'm not terribly displeased with the performance, though I'm wondering if pygeos could do something to speed it up. What do you think @martinfleis ?