lessthanoptimal / ddogleg

Java numerics library for optimization, polynomial root finding, sorting, robust model fitting, and more.
http://ddogleg.org
49 stars 18 forks source link

Vantage point tree implementation for nearest neighbor search #1

Closed karelp closed 9 years ago

karelp commented 10 years ago

Contains a vp-tree implementation based on http://aidblab.cse.iitm.ac.in/cs625/vptree.pdf.

lessthanoptimal commented 10 years ago

The distance metric being used is Euclidean distance. Any reason it can't be Euclidean distance squared? That way you can avoid having any sqrt() operations. Scanning through the code I don't see anything which requires Euclidean distance.

karelp commented 9 years ago

The squared Euclidean distance does not satisfy triangle inequality which vp-trees depend on. It would be nice to shave off the extra time the sqrt takes but the search produces incorrect results without it.

lessthanoptimal commented 9 years ago

Yeah this line would get messed up by the change: if (node.left != null && dist - tau <= node.threshold) {

There's a comment in the code saying quick select might be better. A library with quick select is already included. Think it's worth the effort of changing the code?

karelp commented 9 years ago

Yes, that's precisely the problematic line.

The QuickSelect class cannot be currently used as it is not generic enough for the vp-tree needs. Basically, each step of the vp-tree building procedure partitions the points to two groups - "closer than the pivot" and "farther than the pivot". That's where quick select is used to do this efficiently in an expected O(N) time. The problem is that each step requires a different comparator as it has a different pivot. Also we only points not yet in the tree are partitioned and as a bonus the partition is applied to two lists - items and itemData. This means the QuickSelect class would need to support a custom Comparator and allow running on subarrays. Then the vp-tree implementation would need to merge items and itemData into one array.

This would probably lead to some overhead and the question is whether it's worth it - hence the TODO.