mattijn / topojson

Encode spatial data as topology in Python! 🌍 https://mattijn.github.io/topojson
BSD 3-Clause "New" or "Revised" License
178 stars 27 forks source link

Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192

Closed theroggy closed 2 years ago

theroggy commented 2 years ago

Fixes #191

mattijn commented 2 years ago

Interesting approach. So, simplify(0) will not change the shape of the arc, but can remove points that are in line of the arc. Is it cheap in time?

theroggy commented 2 years ago

Interesting approach. So, simplify(0) will not change the shape of the arc, but can remove points that are in line of the arc. Is it cheap in time?

For the test file, processing time increases from 14.5 s to 16.5 s.

theroggy commented 2 years ago

I had a quick further look at point 3:

It doesn't seem possible to avoid conversions without rewriting quite some code. And then the question is if it is really going to be faster... So it doesn't seem like "short term, low hanging fruit"...

theroggy commented 2 years ago

I implemented a numpy function to remove the collinear points. It has similar performance to simplify(0), but the extra conversions are avoided so the test file can now be processed in +- 15.2 s.

I also had a look at the simplify function in ops, but:

mattijn commented 2 years ago

Yes I think a numpy approach would work as well, but from what I understand now it's still being done line by line. I think it's possible to do it in once for all linestrings using broadcasting.

theroggy commented 2 years ago

Yes I think a numpy approach would work as well, but from what I understand now it's still being done line by line. I think it's possible to do it in once for all linestrings using broadcasting.

I suppose it should be possible, but my search on the subject didn't bring up a lot of ideas vor vectorized solutions...

mattijn commented 2 years ago

One way (as you said before, there is more than one way to Rome),

  1. use as starting point the delta_encoded variant of the linestrings, as is implemented here: https://github.com/mattijn/topojson/blob/4ebe2b15f830840a4908694639e6d4fad64a74f4/topojson/ops.py#L911
  2. calculate the angle before and after each point.
  3. If these angles are equal than the point within can be categorized as redundant.
  4. index these to be removed or flagged in once.
theroggy commented 2 years ago

Don't quite know why I didn't sooner of it, but the current remove_collinear_points was actually quite easy to vectorize...

Profiler says remove_collinear_points now takes 0.28 seconds for the test file, versus 0.382 with the unvectorized version... The total timing didn't really change, but probably that's because there is some fluctuation on the performance of my develop machine (it's on shared infrasteructure).

mattijn commented 2 years ago

No problem! Once you get used to think in adding another dimension it's becoming more fun;). This solution should probably scale better as well.

I setup another branch to test integration continuous benchmarking as part of GitHub Actions, by being inspired of your other PR. This would give also some timing for different type of files as part of each PR.

theroggy commented 2 years ago

Clarification: it is vectorized per linestrings, not for all lines in one go. Will be possible as well, but would need quite ugly code with sparse arrays... I doubt it is worth the trouble... it should scale reasonably well as it is now I think...

theroggy commented 2 years ago

I setup another branch to test integration continuous benchmarking as part of GitHub Actions, by being inspired of your other PR. This would give also some timing for different type of files as part of each PR.

Sounds quite cool... I wonder how stable the results will be performance-wise on the github infrastructure, but if it reasonable it would be great.

Sadly I won't be able to "borrow" the idea for geofileops, because that project is focused on parallellizing spatial operations, so you need a reasonable amount of CPU's to properly benchmark it...

mattijn commented 2 years ago

Ragged arrays are indeed not great. I use this function to create one big array, based on max length of arc and fill remaining with nan:

https://github.com/mattijn/topojson/blob/master/topojson/ops.py#L513

Like this I can get most out of numpy. But if there is one arc super long and others very tiny then it is probably not optimal.

When I find some time I can have a look as well.

theroggy commented 2 years ago

Ragged arrays are indeed not great. I use this function to create one big array, based on max length of arc and fill remaining with nan:

https://github.com/mattijn/topojson/blob/master/topojson/ops.py#L513

Like this I can get most out of numpy. But if there is one arc super long and others very tiny then it is probably not optimal.

Yeah... certainly memory-footprint wise this will be quite heavy for larger files... so doesn't sound very scalable. Performance-wise it also creates quite some overhead... but possibly the looping is that bad that it still compensates...

When I find some time I can have a look as well.

Feel free!

mattijn commented 2 years ago

I'm fine with this as is. Thanks again!

mattijn commented 2 years ago

Generated benchmark image: https://github.com/mattijn/topojson/commit/81036496f27bd51a5041fbb07d3e2783290e5161