Fix + improve performance for join with shared coords=false

theroggy commented 2 years ago

Code to find junctions with shared_coords=False was quite slow and according to some tests (both mine + stated in unit tests) didn't seem to give consistent results. Because of the slow performance I opted for a reimplementation rather than trying to fix. The new implementation is based on using spatial overlays (intersection) instead of looping over points.

A test on a 2.5 MB geopackage with (multi)polygons showed the following timings to create the topology:

shared_coords=True: ~10 s
shared_coords=False, original implementation (release version): ~160 s
shared_coords=False, new implementation in this PR: ~ 18 s There are definitely more opportunities for performance improvements, but the "join" phase seemed to be the main bottleneck.

Regarding the resulting junctions, there were some choices to be made:

"real" junctions: the coordinates where two features touch for a distance > 1 point, are (should be) detected consistently now.
"artificial" junctions: the original javascript topojson implementation also adds all start and end points of linestring objects (not for polygons/rings) as junctions. As they state in the comment in the code linked above this is to simplify some code regarding rotating rings versus splitting later on in the cut step. Because this is a kind of obscure way to force the non-rotating behaviour of lines and because it is easy to implement in a more transparent way in cut I opted not to add these "artificial" junctions. Additionally, artificially increasing the number of junctions is bound to reduce performance in the cut step.
- PS: the original python code didn't seem to be very consistent in adding these end points either: when I added this behaviour, more unit tests showed up "red" than without.

I had to change quite some "asserts" in unit tests, for most of them I tried to deduct if the change was logical based on the input data and this seemed to be the case.

Obviously, feedback very welcome!

references #178

theroggy commented 2 years ago

For reference, the comment copy/pasted from javascript topojson implementation.

// Given an extracted (pre-)topology, identifies all of the junctions. These are
// the points at which arcs (lines or rings) will need to be cut so that each
// arc is represented uniquely.
//
// A junction is a point where at least one arc deviates from another arc going
// through the same point. For example, consider the point B. If there is a arc
// through ABC and another arc through CBA, then B is not a junction because in
// both cases the adjacent point pairs are {A,C}. However, if there is an
// additional arc ABD, then {A,D} != {A,C}, and thus B becomes a junction.
//
// For a closed ring ABCA, the first point A’s adjacent points are the second
// and last point {B,C}. For a line, the first and last point are always
// considered junctions, even if the line is closed; this ensures that a closed
// line is never rotated.

mattijn commented 2 years ago

Thanks for this PR. Much appreciated that you took a dive into the code! Also great that you tried a new approach using geopandas and pygeos. Currently these two packages are not hard dependencies and only shapely and numpy are.

But since geopandas is using shapely under the hood and the ufuncs of pygeos are getting integrated within shapely 2.0 and the 2.0a1 version of shapely is recently released I tried to reproduce your approach using shapely (with shapely=2.0a1) and numpy alone.

# return tree as well from `select_unique_combs`
idx_combs, tree = select_unique_combs(data["linestrings"])
# collect geoms from tree (this are views or copies?)
geom_combs = np.asarray([
    [tree.geometries.take(idx[0]), tree.geometries.take(idx[1])] 
    for idx in idx_combs])

# find intersection between linestrings 
# use relate pattern to filter out single point intersections
g1s = geom_combs[:,0]
g2s = geom_combs[:,1]
intersect_1dim = shapely.relate_pattern(g1s, g2s, pattern='1********')
segments = so.intersection_all(geom_combs[intersect_1dim], axis=1)
merged_segments = shapely.line_merge(segments)

# the start and end points of the merged_segments are the junctions
coords, index_group_coords = shapely.get_coordinates(merged_segments, return_index=True)
_, marker_groups = np.unique(index_group_coords, return_index=True)
idx_startend = np.append(marker_groups, marker_groups -1)

# junctions can appear multiple times in multiple segments, remove duplicates
idx_startend_group = np.unique(idx_startend)
junctions = geometry.MultiPoint(coords[idx_startend_group])

# result
print(len(junctions.geoms)),
print(junctions.wkt)

2
MULTIPOINT (522 1108, 530 1100)

Its an incomplete attempt since I was not yet able to profile the approaches, but hopefully this will results in similar timings. Again, much appreciated! (I did not yet look to the other PR)

theroggy commented 2 years ago

Interesting.

I do understand that introducing geopandas as a hard dependency indeed needs conscious thought (which I didn't give it ;-))... and isn't obvious.

However, possibly using it more might help simplifying and speeding up topojson (on longer term)... For several data structures in topojson a GeoDataFrame sounds like a very good match: mainly the (temporary) lists of linestings, junctions and coordinates. Using a GeoDataFrame the index is already there and doesn't need to be recreated for each new lookup and a lot of looping in python can be avoided which reduces boiler-plate code and should be faster because it is moved to looping in a c library (eg. geos).

Also, as far as I know, the dependencies of geopandas have been slimmed down recently. Mainly fiona (and thus gdal) is now an optional dependency... so pandas and geopandas itself would be the extra dependencies.

But, it is also clear to me that is definitely not a must-have dependency in topojson... I might just simplify/speed up some things.

theroggy commented 2 years ago

For information: this is the file I used to test the performance.

Probably performance testing with different kinds of data would be a good idea:

one with data with many rows with few points/polygon (eg. cadastral parcels)
one with few rows but many points/polygon (eg. (detailed) country borders)

This file is a bit in between I think: few rows, with an "average" number of points.

topo1778-ferraris_05_309_BEFL-topo-1778-ferraris_raw.zip

mattijn commented 2 years ago

Fair enough. No problem either. You could not have known it.

In the beginning I used geopandas and shapely more intensively, but slowly the development has moved to hash tables and numpy array broadcasting. The latter scale nice, but not always, sometimes simple loops/list comprehensions in combination with an strtree works even better.

I even think it is possible to do it without shapely. Single dependency, numpy, would be a nice achievement.

But now with shapely 2 coming, they are back in game again I think.

Thanks for the test file, let me see if I can do some timings. You are right that depending the file the timings can be very different. The geopandas.datasets.get_path('nybb') is the one that I timed against, which often had quite different timing behavior than 'naturalearth_lowres'.

theroggy commented 2 years ago

I even think it is possible to do it without shapely. Single dependency, numpy, would be a nice achievement.

As you stated already, you are not planning to do so, but because I think it is an interesting topic...

I'm not sure that only numpy would be a good idea. Optimisations specific to GIS data are quite important if you want to be able to achieve some efficiëncy and scalability and numpy doesn't offer those as far as I know.

Using a spatial index to find relevant features is one thing, but GIS libraries like geos (and thus shapely/pygeos) also (can) use e.g. spatial indexes within a geometry to speed up processing. To give a simple example: if you have a linestring of 15.000 points (eg. the border of flanders), and you need to loop over all points to check if a point is in the list... this will take a lot more time than an approach where the 15.000 points are spatially indexed... The same (or even more so) for spatial operations like intersection,... like I used in my alternative implementation: these operations are heavily optimised in the underlying libraries (GEOS) which is the only (at least fastest) way to get decent performance for such things.

The optimisations geopandas offers are rather in use cases where there are a lot of features that need to be treated. Because the geometry data in geopandas is already stored in arrays in a way that can be used directly by geos without having to loop over all rows in python code, which is slow.

mattijn commented 2 years ago

I agree with the most, spatial indexes/strtree are very powerful and this package make and will make good use of these.

But not all geos functions are as optimized as you would think, see https://github.com/pygeos/pygeos/issues/73. Up until now, with this PR (again, thank you for reaching out!), the main bottleneck of this package could not be optimized using pygeos.

This is also recently discussed at the geos repo: https://github.com/libgeos/geos/issues/640.

theroggy commented 2 years ago

I agree with the most, spatial indexes/strtree are very powerful and this package make and will make good use of these.

But not all geos functions are as optimized as you would think, see pygeos/pygeos#73. Up until now, with this PR (again, thank you for reaching out!), the main bottleneck of this package could not be optimized using pygeos.

This is also recently discussed at the geos repo: libgeos/geos#640.

If I gave the impression any of the packages I mentioned was perfect, I stand corrected ;-), so I agree 100% with your comment.

I wonder how the performance of the sharedpaths function compares to intersection, as they basically do the same thing, so I would expect that "under the hood" they would share the same implementation, or do I miss something?

mattijn commented 2 years ago

Since these changes are dependent on shapely 2, which is still in alpha phase, how to approach this?

I can see two options, but open to others:

Trying to see if the core of the logic in this PR can be backported to shapely 1 and include that in this PR.
Setup a topojson 2 branch with a minimal required version of shapely 2 and replace all written logic that was necessary to get a bit more speed with shapely 1.

With the first option we could do an official release where everybody benefits. But the second option is maybe easier to maintain?

theroggy commented 2 years ago

I'd been thinking about that as well... There is quite a lot of activity going on on shapely2, and most work on it has definitely been finished... but still it hard (for me) to judge when the first stable release will arrive...

I was also wondering about the shared_coords=True code path... is there a realworld use case for it or was it only introduced for performance reasons?

mattijn commented 2 years ago

In my opinion it's only there for performance reasons. So if the 'path-connected' approach is reasonably fast it can be the only one.

But based on this bullet, without reading much of the code, so I might be wrong, the approach currently introduced in GEOS, seems to be a 'coords-connected' one. They use a union on the linestrings to get the coverage. Which is interesting as well

theroggy commented 2 years ago

I'm looking into avoiding the use of new features in shapely 2... and that doesn't seem to be an issue.

In my opinion it's only there for performance reasons. So if the 'path-connected' approach is reasonably fast it can be the only one.

OK, that would cleanup quite some code + the tests.

But based on this bullet, without reading much of the code, so I might be wrong, the approach currently introduced in GEOS, seems to be a 'coords-connected' one. They use a union on the linestrings to get the coverage. Which is interesting as well

Yes, I was thinking along those lines as well to speedup the cut step... I already gave it a quick try to do a union of the linestrings and the junctions, but this didn't seem to work at first sight. The next step was indeed to try to union the lines directly, which is apparently the approach chosen in geos.

Another option I was thinking about would be to calculate both the intersection and the symmetric difference (= "union" in ESRI speak) and save those in join already so the "cut" step has less lines to split...

There are many ways to Rome :-).

mattijn commented 2 years ago

There is a constant available SHAPELY_GE_20 that you can use for the parts that require shapely 2 features, like here: https://github.com/mattijn/topojson/pull/184/files#diff-3f40272a908f4690d8fe0530e35685547cdd07d499dcb45ff94bee078bc5abe2R556.

Before I also have considered to do multiple things in once, but on purpose have done the development step by step to make it a bit easier for testing and tracing. With the current test-set in place, there are lots of corner cases tracked so if there can be an integration of steps, reducing code while improving speed, then its something to consider for sure.

theroggy commented 2 years ago

Performance suffered slightly: ~16 s for the test file we've been using vs ~14 s for the shapely 2 version.

But I personally don't think it is worth maintaining 2 versions for that difference... and I'd wait till it is oficially released before making a dependent version....

Possibly the performance degradation is larger for files with many rows because vectorized operations is the main change in shapely2.

A point of interest I want to raise is that this implementation gives another result for the number of arcs in topo.output['arcs'] on that same test file: 12.864 arcs vs. 13.620 arcs (shapely2 + geopandas version) for 4220 input polygons even though all implementations pass all the tests.

This might also have an impact on performance, one way or the other, because more or less junctions,.... can make a difference.

theroggy commented 2 years ago

I wonder, do you have some benchmarking files/code that you use to follow up the evolutions/differences in performance for different types of test files as we discussed before?

If so, it might be usefull to add them to the code base as well.

If not, this is an example on how this could look: https://github.com/geofileops/geofileops/tree/master/benchmark/results

mattijn commented 2 years ago

So you have reduced the shapely 1 implementation from 160 secs to 16 seconds?! Impressive!

So the current implementation using shapely 2 has created somehow more junctions. I remember I used linemerge of shapely before, but then implemented something different because of issues, maybe around there🤔?

I'm very open for this type of benchmarking! Before I just add some timings or profiling in the PR, but something normalized would be great.

mattijn commented 2 years ago

Upon second thought, I would first check the len() of what is returned from select_unique_combs, maybe some logic is missing there that you had before (eg. .query("index_1 != index_2"), here https://github.com/mattijn/topojson/commit/4a018ff77ae4bd1864e88b30cf1548cbc0632c59#diff-d9cc69527fda279cb84b6499a9e0e6499442f9f7b895f00f5046e60bd5c817cbR211)

theroggy commented 2 years ago

There was an issue in the extraction of (only) linestrings out of the intersection result in combination with the linemerge.

For info: apparently in shapely1 you need to make sure you only pass lines to linemerge, in shapely2 linemerge does this cleanup.

Timing to treat the test file is now 14.5 seconds...

mattijn commented 2 years ago

I think this pr is ready for merging. Agreed?

theroggy commented 2 years ago

Inspired by the change/improvement of linemerge in shapely2 I restructured some code so it mimics this behaviour. Potato-potato, but it simplifies the code in join, and looks a bit nicer like this I think?

theroggy commented 2 years ago

For me it is ready for merge!

mattijn commented 2 years ago

Love this! Thanks a lot. Really impressive work. Beautiful code. Superlatives etc.

mattijn / topojson

Fix + improve performance for join with shared coords=false #179