mattijn / topojson

Encode spatial data as topology in Python! 🌍 https://mattijn.github.io/topojson
BSD 3-Clause "New" or "Revised" License
178 stars 27 forks source link

shared_coords=True vs shared_coords=False #187

Closed theroggy closed 2 years ago

theroggy commented 2 years ago

As already briefly touched here: https://github.com/mattijn/topojson/pull/179#issuecomment-1236073935

I was also wondering about the shared_coords=True code path... is there a realworld use case for it or was it only introduced for performance reasons?

In my opinion it's only there for performance reasons. So if the 'path-connected' approach is reasonably fast it can be the only one.

I'm not sure what the best way forward is for this?

theroggy commented 2 years ago

Regarding the section about coverages in geos you mentioned:

 * A polygon is coverage-valid if:
 *
 *   * The polygon interior does not intersect the interior of other polygons.
 *   * If the polygon boundary intersects another polygon boundary, the vertices
 *     and line segments of the intersection match exactly.

As you probably know, gaps and slivers are a big issue in GIS data in general. As long as not all borders between polygons/features are explicitly matched to each other including by having matching vertices everywhere the features touch having a perfect topology is impossible due to rounding errors,... Just doing a conversion from one file format to another can create gaps because of a conversion from double to discrete numbers,... so data that seemed properly matched/snapped will change to gappy/slivery data.

So, even though the current implementations using intersections doesn't need all vertices to be there, once you get into more complex data the places where a point was snapped to the middle of a line without adding the snap-vertex to the neighbour, some of those cases won't be properly "topologized". The only structural way to get perfect results all the time and with all operations (this is not limited to creating a topology) is that data is perfectly matched... So, if I understand correctly they (geos developers) are in first focussing on tools to cleanup the data... (which is something I've been hoping for to appear for a long time) and which is useful for many more cases than topology. Once the data is cleaned, it will be a lot easier to create topologies from it.

The data I'm working with at the moment, and that I've been using to test, is "happy day scenario" data. It is the result of a polygonize of raster data, so all intersections between data are perfectly matched: every segment is either perfectly horizontal or perfectly vertical, so no gaps and slivers in the data. Most data out there isn't like that though :-(...

theroggy commented 2 years ago

So, as long as the shared_coords=True, is faster, an alternative approach could be to change the default from shared_coords=True to shared_coords=False, as this will give the best results for most datasets, but if the user is sure the data is already 100% cleaned/prepared (~coverage-valid), he can use shared_coords=False to get the bit of extra performance.

Mind: when I was adding tests, I first started by running them both using shared_coords=False and shared_coords=True. But, in the first 2 cases where I did this this resulted in what seemed to be a bug at first sight in the shared_coords=True path. So, I might be wrong, I was focused on shared_coords=False, but at first sight there are still some bugs there... that are best fixed if the option is kept alive.

mattijn commented 2 years ago

I'm fine with changing the default from shared_coords=True to shared_coords=False (edit: done by #190).

Did you play with the prequantize parameter? Its applying a type of normalisation on range, see docs: https://mattijn.github.io/topojson/example/settings-tuning.html#prequantize. It will improve the quality of the topology if the input geometry is messy. Current default is 1e6, but I like to change the default to 1e5 (edit: done by #189). But one can play with this value (any integer number is valid) to see what is best for each situation.

mattijn commented 2 years ago

this resulted in what seemed to be a bug at first sight in the shared_coords=True path

Also observed a, what seems like, bug with shared_coords=True (new default in master is shared_coords=False):

import geopandas
from topojson import Topology
nybb_path = geopandas.datasets.get_path("nybb")
data = geopandas.read_file(nybb_path)
topo = Topology(
    data=data, prequantize=200, shared_coords=True
)
topo.to_alt()
image

There should be a line string near the orange arrow.

theroggy commented 2 years ago

Did you play with the prequantize parameter? Its applying a type of normalisation on range, see docs: https://mattijn.github.io/topojson/example/settings-tuning.html#prequantize. It will improve the quality of the topology if the input geometry is messy. Current default is 1e6, but I like to change the default to 1e5 (edit: done by #189). But one can play with this value (any integer number is valid) to see what is best for each situation.

No I haven't. I turned it off because the data I've used till now didn't need any cleaning. So I don't have any opionion on what a good value would be...