mapbox / robosat

Semantic segmentation on aerial and satellite imagery. Extracts features such as: buildings, parking lots, roads, water, clouds
MIT License
2.02k stars 383 forks source link

Optimizing rs merge #208

Closed Venka97 closed 4 years ago

Venka97 commented 4 years ago

rs merge is currently very slow for files with a large number of polygons (>20k). The merging part can be parallelized but the graph building takes very long.

I will benchmark the results of graph construction, traversal etc. and post the results.

Open for a discussion to improve this, if anyone has faced a similar issue, please feel free to contribute!

daniel-j-h commented 4 years ago

Here is the graph: https://github.com/mapbox/robosat/blob/b593a6b12957649767aaeb2dc60951e453a088ca/robosat/graph/core.py

It's a simple hand-rolled undirected graph, with depth-first search and connected component clustering capabilities.

Per vertex s it is storing all outgoing edges to vertices t_n in a set. And because we want an undirected graph, we do the same for every vertex t for all outgoing edges to vertices s_n. The graph semantic is super simple to implement with a defaultdict(set) but there's quite some overhead in it.

If your benchmark results show the graph construction or traversal to take the majority of time spent here, you have two options:

If you are actively using this, I highly recommend you looking at the issues and potentially first upgrading dependencies.

I no longer actively maintain this and I am no longer with Mapbox.

daniel-j-h commented 4 years ago

Hey @Venka97 - what came out of this? Have you looked into it?

Venka97 commented 4 years ago

Hey @daniel-j-h, I was occupied this week with some other work. I will do it as soon as I get free and post an update.

Venka97 commented 4 years ago

Hey, @daniel-j-h , so I tried integrating undirected graph using something pre-made. I tried out both networkx and graph-tool and there wasn't much difference in graph construction. If anything, networkx is slightly slower than your implementation.

I am yet to look into more efficient graph representations.

Venka97 commented 4 years ago

@daniel-j-h, I think have managed to solve the issue. The graph building was taking a lot of time but it didn't have anything to do with it.

From pyproj documentation, given that we use same transformations, i.e. epsg:4346 -> equal area and back, we can pre-initialize it rather than passing the CRS as args in buffer and unbuffer methods.

Doing this gave me a massive improvement, for a sample geojson before it took me ~23 minutes for graph building loop. After this, it took ~30 seconds.

daniel-j-h commented 4 years ago

Cool! Can you open a pull request with this change? 🤗

On August 12, 2020 10:00:18 AM UTC, Venkatesh notifications@github.com wrote:

@daniel-j-h, I have managed to solve the issue. The graph building was taking a lot of time but it didn't have anything to do with it.

From pyproj documentation, given that we use same transformations, i.e. epsg:4346 -> equal area and back, we can pre-initialize it rather than passing the CRS as args in buffer and unbuffer methods.

Doing this gave me a massive improvement, for a sample geojson before it took me ~23 minutes for graph building loop. After this, it took ~30 seconds.

Venka97 commented 4 years ago

Sure, I'll do that!

Venka97 commented 4 years ago

Opened #209