spotify / annoy

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
Apache License 2.0
12.9k stars 1.15k forks source link

idea for faster/better tree structure #64

Closed erikbern closed 9 years ago

erikbern commented 9 years ago

What if at every split we just sample two items and compute the hyperplane that separates them?

For angular distance you would normalize the two items first. This would also generalize to arbitrary pseudometrics nicely: Manhattan, KL-divergence, etc.

For highdimensional distributions where there's structure in the data, this is more likely to sample from that structure. It will also be much faster in general since we are guaranteed to split the point set in two subsets

a1k0n commented 9 years ago

Yeah, that makes a lot of sense.

I mean, you could also take a random direction and then compute the dot product on all points in the leaf and find the median, and do that over and over, but for angular distance you'd have to introduce a bias coefficient to make that work and I don't know if that destroys the cosine distance guarantees you get with signed random projections. It would be better for the euclidean one though.

This is a really neat idea though.

erikbern commented 9 years ago

I'll think of a way to test how well it works.

erikbern commented 9 years ago

See #65 – results are very good

stefansavev commented 8 years ago

Hi Erik,

First congratulations on the excellent results with the new method. Just wanted to point out a paper by Sanjoy Dasgupta where he considers a related rule. May be the paper will useful. http://cseweb.ucsd.edu/~dasgupta/papers/rptree-stoc.pdf (Random projection trees and low dimensional manifolds) You can look for the text: "choose a random unit direction v ∈ R; pick any x ∈ S; let y ∈ S be the farthest point from it" No idea how it relates performance-wise to your rule.

Cheers,

Stefan

erikbern commented 8 years ago

Thanks a lot – looks pretty interesting! Here's the relevant snippet (I think)

image

I'll implement this at some point and benchmark!

a1k0n commented 8 years ago

That method still picks a uniformly random direction; the points x and y are only used to find the split plane along the direction, which annoy doesn't really support -- the split plane is always at the point where the dot product is 0.

erikbern commented 8 years ago

@a1k0n – that's only for cosine. For Euclidean, you can pick any split plane

a1k0n commented 8 years ago

@erikbern of course. Yes, this might make sense for the Euclidean method. I don't follow the argument in the paper yet but it's very interesting. Thanks @stefansavev

erikbern commented 8 years ago

This also came out from ICML – looks pretty interesting: https://izbicki.me/blog/fast-nearest-neighbor-queries-in-haskell.html (probably not related to the above discussion though)

stefansavev commented 8 years ago

@a1k0n ,@erikbern Thanks guys for your comments. I just brought up this paper because it is using a sample of two points to estimate a global property of the dataset (in their case the diameter of a dataset). You are also using two points towards a similar goal but differently. I saw as Andy pointed out that Dasgupta use x and y to figure out the split point only but not the direction, while you use x and y for the direction. I would think that your rule makes more sense. I am not sure if Dasgupa's rule will bring something, but just to clarify his argument. His setup is that there are two clusters (this is his figure 2), Bi and Bj which are with small diameter but somewhat further apart [a bit stronger assumption]. He wants to separate them cleanly. He is saying that by adding a random offset to the median we can improve the chance of a good split. Bi and Bj themselves live in a top level clusters and he uses x and y to find the diameter of the top level cluster.

Under your rule if you have two clusters, you will have a good split if you pick x and y to be from different clusters. I think this rule works better when you have a larger diameter (more dimensions in the dataset, more data points, more diverse data points). So it's interesting to see if the effectiveness of the rule diminishes in lower nodes of the tree. I know that when you do euclidean distance you try a couple of directions and choose the best (the one with the widest projection). In conjunction with the new rule, this will mean that you also want x and y to be further apart. Together both rules will imply that you will select the split points to be from different "clusters". One thing that I saw that people do in a similar context (PCA-trees) is when they choose a direction d (in your case x - y) they subtract it from the data. This gives them the property that subsequent splits are orthogonal. I'm keen to try this myself. I'm planning to subtract the previous direction from the new direction (not from the data) which is obtained by sampling two new points. I think here is also safe to require that ||x - y|| is not too small because the orthogonality will give rise to a different split than at the parent level. When ||x - y|| is small the rule approaches the random hyperplane rule, so it might not be too bad (as long as x - y does not have too many zeros).

I'm pretty excited about your methods because it has quite nice properties: 1) cheap; 2)it has randomness; 3) it is informed by the data.

Cheers,

Stefan

erikbern commented 8 years ago

I'm actually not trying separate direction for the Euclidean, but it's something I should do.

Let me know if orthogonality helps, sounds like it could be useful!

Separately from the discussion – I really think the best method wouldn't be to use smart heuristics, it would be to define some loss function and optimize the whole tree for that. What you're really trying to do is you have some high dimensional distribution and you want to represent it as a tree in a way that items that are close in the first space are also close in the tree, and items that are far in the first space are far in the tree. I think it's possible to define a loss function so that you look at discordant pairs of nearest neighbors and try to optimize the tree splits after that.

stefansavev commented 8 years ago

Sounds like a very promising idea (the idea of a loss function). May be this discussion deserves a separate page. I'd be happy to contribute comments on it. One cheap way to try it is to take a dataset, compute top 10 NN, and then run it through boosted decision trees for learning to rank (there is an implementation here which is of high quality: https://github.com/yasserg/jforests). Learning to rank considers discordant pairs in its objective.

I'll write separately for the other ideas.