knaaptime commented 3 years ago

this PR adds two new functions:

get_gadm is a nice convenience to grab GADM data. It can be super handy when demoing tobler, but also seems like it fits in here
extract_raster_features converts a subset of pixels to a geodataframe. There are several ways this could be useful in the library and it's more performant than i was expecting though can still take a couple minutes for large problems. The time consuming part is this line that converts the pixel features to polygons. Again, I'm not terribly displeased with the performance, though I'm wondering if pygeos could do something to speed it up. What do you think @martinfleis ?

codecov-io commented 3 years ago

Codecov Report

Merging #130 (86223cc) into master (6945c6d) will decrease coverage by 2.55%. The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master     #130      +/-   ##
==========================================
- Coverage   82.52%   79.97%   -2.56%     
==========================================
  Files          15       16       +1     
  Lines         767      809      +42     
==========================================
+ Hits          633      647      +14     
- Misses        134      162      +28

Impacted Files	Coverage Δ
tobler/dasymetric/raster_tools.py	`31.70% <31.70%> (ø)`
tobler/dasymetric/__init__.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6945c6d...86223cc. Read the comment docs.

martinfleis commented 3 years ago

I'm wondering if pygeos could do something to speed it up

The input here is JSON, right? I don't think there's a faster way than a loop now. GeoPandas has almost the same code in from_features.

knaaptime commented 3 years ago

The input here is JSON, right? I don't think there's a faster way than a loop now. GeoPandas has almost the same code in from_features.

cool thanks. In that case, i'm pretty happy with this

knaaptime commented 3 years ago

ok this should be good to go. It uses joblib to parallelize the geometry creation (performance example here). There are still a few ways i could imagine parallelizing this further. We could maybe do something like rioxarray to do the clipping in lines 72-73 (though not sure if that would also return the transform object, so would need to write the intermediate rasters out to temp files, which would probably mitigate the performace gain by splitting apart.). But the slowest remaining pieces are the unary_union in the beginning and the rasterio shape creation at line 85-86 both of which could easily be parallelized if we could spatially partition the geodataframe using its spatial index or something

martinfleis commented 3 years ago

performance example here

That is interesting! Why do you think there's such a difference between dask and joblib? Is that dask-scheduler overhead?

darribas commented 3 years ago

This looks great. A couple of thoughts:

If you plan on making joblib a hard dependency this is fine. Otherwise, you could check n_jobs=1 and run it without the parallel implementation. The code should be what happens inside a chunk, applied to the entire table.
I find really interesting the comparison between joblib and dask. I wonder if the performance hit has to do with the use of processes instead of threads (which are generally faster?). Did you try this out?

knaaptime commented 3 years ago

yeah dask with threads took as long or longer than the sequential approach so didnt even bother including it. I'll go ahead and update the gist for completion

darribas commented 3 years ago

That’s really interesting... I wonder why that extra overhead...

On Mon, 25 Jan 2021 at 16:21, eli knaap notifications@github.com wrote:

yeah dask with threads took as long or longer than the sequential approach so didnt even bother including it. I'll go ahead and update the gist for completion

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/pysal/tobler/pull/130#issuecomment-766933427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADF4UZLNGKJ3HQGDUT2BQDS3WLBXANCNFSM4WN5JIUA .

--

Daniel Arribas-Bel, PhD. Url: darribas.org Mail: D.Arribas-Bel@liverpool.ac.uk

Senior Lecturer in Geographic Data Science Department of Geography and Planning University of Liverpool (UK)

knaaptime commented 3 years ago

I have no idea whats up with these results, only that its clear joblib is the winner. I'd be fine making it a hard dependency.

The real trouble, though, is that the motivation for this function was to support a workflow like this one, where the resulting gdf is used to clip an input like census data, before passing the clipped data onto area_interpolate. But the clip operation absolutely explodes when using these raster polygons as masking features.. I can take a single row of an input frame and clip that, but using the full input dataframe with a few hundred tracts takes hours and hours to complete

knaaptime commented 3 years ago

i think the clip speed issue i mention above is twofold.

tl;dr we can get around it by defining a more efficient clip function (which could also be parallelized)

def iterclip(source, mask):
    tree = mask.sindex
    features = []
    for i, row in source.iterrows():
        query = tree.query(source.geometry.iloc[i], predicate='intersects').tolist()
        submask = mask[mask.index.isin(query)]
        single = gpd.GeoDataFrame(row.to_frame()).T
        features.append(gpd.clip(single, submask))
    frame = gpd.GeoDataFrame(pd.concat(features))
    frame.crs = source.crs
    return frame

There's a gist here that walks through the issue. The test data has 400+ census polygons and roughly 24,000 polygons extracted from a raster. Calling clip on those data directly is a non-starter because (1) calling unary_union on the 24k frame takes a long time, and (2) the unary_union of the rasters eclipses the bbox of each feature in the census frame so the spatial index isnt helping. Instead, i tried looping over each feature in the census df, taking only the raster polys that hit its spatial index and using that as a clip feature. Much faster. Is this an approach we might want to send upstream to geopandas?

martinfleis commented 3 years ago

Is this an approach we might want to send upstream to geopandas?

Yes please.

pysal / tobler

add raster extractor #130

Codecov Report

--

Senior Lecturer in Geographic Data Science Department of Geography and Planning University of Liverpool (UK)