tzaeschke / phtree

PH-Tree
Apache License 2.0
121 stars 22 forks source link

how well it performs on high dimensional data, such as SIFT #39

Open wlzhao22 opened 1 month ago

wlzhao22 commented 1 month ago

Thanks for your work! I am just wondering how well it works on high dimensional data, such as SIFT or various deep features. Comparing with Hilbert curve, does it hold similar property in preserving the locality of the high dimensional points. That is, when two points a, b \in R^d are neighbors, they will be neighbors as well when they have been converted to one-dimensional values.

tzaeschke commented 1 month ago

I assume you are interested in KNN query performance? It works "reasonably" well with high dimensional data (I have tested the Java implementation with 1000 dimensions and the C++ implementation with 60 dimension).

Short answer

You will have to measure it. I think there is a good chance the PH-tree works very well, it maybe 20% slower than other or equally likely considerably faster.

Long answer

There are several points to this:

I have put some of my measurements (up to 40 dimensions) in a document: https://github.com/tzaeschke/TinSpin/blob/master/doc/benchmark-2017-01/Diagrams.pdf:

If your are happy with approximate KNN, you can also have a look at https://github.com/erikbern/ann-benchmarks

Hilbert Curves

The PH-tree follows a Z-curve (Morton order) which has almost the same locality preserving features that Hilbert curves have, however z-curves a much easier and faster to calculate than Hilbert curves.