holoviz / spatialpandas

Pandas extension arrays for spatial/geometric operations
BSD 2-Clause "Simplified" License
308 stars 24 forks source link

HilbertRtree query is slower than Rtree and Pygeos #52

Closed carlosg-m closed 2 years ago

carlosg-m commented 3 years ago

For me the sindex is one of the most interesting parts of this project. I was able to run some tests using only geometry bounds and numpy arrays.

from spatialpandas.spatialindex import HilbertRtree

The creation of the HilbertRtree is vectorized and relatively fast, however queries are not vectorized and really slow (2x slower compared to Rtree and 10x slower compared to Pygeos). Is there anything that can be done to improve scalability and decrease time complexity?

Also, could someone please explain how the sindex is distributed, more specifically how the partition_sindex works. For example, how is the tree creation distributed? Each right dataframe partition has its own sindex? Is it necessary to query each right sindex with all left index partitions (n^2)? Is the sindex serializable and thread safe?

jonmmease commented 3 years ago

It's been a while, but FWIW here is the benchmarking notebook I used during development to compare performance to RTree (https://anaconda.org/jonmmease/spatialpandas_hilbertrtree_benchmarking_pr/notebook). This predated Pygeos, so there aren't any comparisons there.

Things may have changed in RTree, but at the time the HilbertRTree was equivalent or faster in these cases.

If you're interested in repeating this notebook with the current versions of things, and adding pygeos, that would be great.