Closed theroggy closed 2 years ago
Interesting approach. So, simplify(0)
will not change the shape of the arc, but can remove points that are in line of the arc.
Is it cheap in time?
Interesting approach. So,
simplify(0)
will not change the shape of the arc, but can remove points that are in line of the arc. Is it cheap in time?
For the test file, processing time increases from 14.5 s to 16.5 s.
I had a quick further look at point 3:
It doesn't seem possible to avoid conversions without rewriting quite some code. And then the question is if it is really going to be faster... So it doesn't seem like "short term, low hanging fruit"...
I implemented a numpy function to remove the collinear points. It has similar performance to simplify(0), but the extra conversions are avoided so the test file can now be processed in +- 15.2 s.
I also had a look at the simplify function in ops, but:
Yes I think a numpy approach would work as well, but from what I understand now it's still being done line by line. I think it's possible to do it in once for all linestrings using broadcasting.
Yes I think a numpy approach would work as well, but from what I understand now it's still being done line by line. I think it's possible to do it in once for all linestrings using broadcasting.
I suppose it should be possible, but my search on the subject didn't bring up a lot of ideas vor vectorized solutions...
One way (as you said before, there is more than one way to Rome),
delta_encoded
variant of the linestrings, as is implemented here: https://github.com/mattijn/topojson/blob/4ebe2b15f830840a4908694639e6d4fad64a74f4/topojson/ops.py#L911Don't quite know why I didn't sooner of it, but the current remove_collinear_points was actually quite easy to vectorize...
Profiler says remove_collinear_points now takes 0.28 seconds for the test file, versus 0.382 with the unvectorized version... The total timing didn't really change, but probably that's because there is some fluctuation on the performance of my develop machine (it's on shared infrasteructure).
No problem! Once you get used to think in adding another dimension it's becoming more fun;). This solution should probably scale better as well.
I setup another branch to test integration continuous benchmarking as part of GitHub Actions, by being inspired of your other PR. This would give also some timing for different type of files as part of each PR.
Clarification: it is vectorized per linestrings, not for all lines in one go. Will be possible as well, but would need quite ugly code with sparse arrays... I doubt it is worth the trouble... it should scale reasonably well as it is now I think...
I setup another branch to test integration continuous benchmarking as part of GitHub Actions, by being inspired of your other PR. This would give also some timing for different type of files as part of each PR.
Sounds quite cool... I wonder how stable the results will be performance-wise on the github infrastructure, but if it reasonable it would be great.
Sadly I won't be able to "borrow" the idea for geofileops, because that project is focused on parallellizing spatial operations, so you need a reasonable amount of CPU's to properly benchmark it...
Ragged arrays are indeed not great.
I use this function to create one big array, based on max length of arc and fill remaining with nan
:
https://github.com/mattijn/topojson/blob/master/topojson/ops.py#L513
Like this I can get most out of numpy. But if there is one arc super long and others very tiny then it is probably not optimal.
When I find some time I can have a look as well.
Ragged arrays are indeed not great. I use this function to create one big array, based on max length of arc and fill remaining with
nan
:https://github.com/mattijn/topojson/blob/master/topojson/ops.py#L513
Like this I can get most out of numpy. But if there is one arc super long and others very tiny then it is probably not optimal.
Yeah... certainly memory-footprint wise this will be quite heavy for larger files... so doesn't sound very scalable. Performance-wise it also creates quite some overhead... but possibly the looping is that bad that it still compensates...
When I find some time I can have a look as well.
Feel free!
I'm fine with this as is. Thanks again!
Generated benchmark image: https://github.com/mattijn/topojson/commit/81036496f27bd51a5041fbb07d3e2783290e5161
Fixes #191