pixelogik / NearPy

Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes.
MIT License
759 stars 152 forks source link

Retrieving `k` nearest neighbours? #5

Open piskvorky opened 10 years ago

piskvorky commented 10 years ago

Is there a process/parameter setting that will make sure queries return no less than k ANN? (and preferably also not much more than k... exactly k would be ideal)

Like when the user asks for exactly top-20 most similar items.

I'm imagining something like searching nearby buckets, if there's too few candidates. I haven't checked the code yet, maybe it's already there :)

pixelogik commented 10 years ago

Not really yet. If the combined candidate list of all buckets that were "hit" by the hashes has less then k vectors, then this is not possible yet. If the candidate set is larger the NearestFilter returns exactly k.

I do not see a simple solution to this. Because searching "nearby" buckets would mean, that you can ask a hash to give you "nearby" buckets to the given query vector, which is identical to computing the hash just multiple times for some points around the query vector which at the end is kind of identical to the use of bigger bins/buckets (in terms of space) or the use of multiple hashes.

So a simple solution now would be to just use multiple hashes at the same time, which is build into the architecture of the library. Using multiple hashes helps with this problem as long as the data is approximately equally distributed in space. If this is not the case, and your data has different densities, this is hard. In this case we need a new type of hash function, that pays respect to density and that can be trained by a training set to learn the densities. I would be happy to implement something like that together with you if you are interested. The library is lying around for some time now anyway and that hurts my eyes :)

Dunno how quick you need that stuff, but if you would like to implement something like that I would be happy to help. Giving big support is currently hard because I have to do a project until February.

piskvorky commented 10 years ago

Alright, let's do it.

I'm currently benchmarking some ANN libs [1] because I want to add an ANN algo to gensim (gensim only has brute force linear search now). But NearPy code looks high quality, so I'm thinking of adding it as a dependency, instead of re-implementing something from scratch.

There are other people interested in contributing to a faster gensim k-nn algo, so we may get more help too.

[1] http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants

pixelogik commented 10 years ago

Alright. We might have to do some experiments regarding speed of NearPy. As far as I remember (have not touched code base since last commit) the sparse representation in the feature branch was slow compared to the non sparse representation. However the feature branch contains additions to store the hashes in storage (18a00641c432dade349c1267cbcc5322f191b159), which is important in order to persist a given configuration.

If you have any insights if NearPy suits your needs or not, tell me. I would be happy to bring NearPy to another level.

piskvorky commented 10 years ago

I published the benchmark numbers: http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/

I had to leave out LSH because of the k problems. Do you have some plan of action as to how to mitigate/solve it?

I would love to help/get more help, if you propose concrete steps of what to do next @pixelogik.

pixelogik commented 10 years ago

So you want to always get k results independent of the query vector? As long as the density of your dataset is almost everywhere the same, this can be reached by adjusting the "bin sizes" of the discretized projections (RandomDiscretizedProjections) so that there are always >k results in each bucket. Another possibility is to just use multiple LSHs in the NearPy pipeline to increase result sets. But this only works with constant densities.

It is tricky when it comes to real world datasets with varying densities.

The LSH methods in NearPy use projections onto individual vectors in the feature space. These projection vectors are either generated randomly (RandomDiscretizedProjections, RandomBinaryProjections) or using a PCA analysis of a given training data set (PCADiscretizedProjections, PCABinaryProjections).

After indexing using the used LSHs, the size of each bucket depends on the following parameters: (1) With both binary and discretized projections, the more projection vectors one LSH has, the less vectors will fall in one bucket. (2) With discretized projections, the larger the bin size, the more vectors will fall in one bucket. (3) With both projection types, the more LSHs hashes (each having for example only one projection vector) you have, the more vectors will fall in one bucket. But this is limited by the actual density. More LSHs means only better sampling of the local neighbourhood.

So the only impact with respect to density has (1) and (2). And the only way (in this NearPy setup) to increase bucket sizes by force is (2).

(2) however is the antagonist of the whole ANN idea. We want buckets small in order to make ANN fast.

So with the existing setup, NearPy is not able to be density sensitive.

We could do something like a hierarchical projection onto the projection vectors, with buckets and sub-buckets and so forth (like Binary Space Partitioning but not binary).

Unfortunately I am still busy with this client project, but as soon as I have more time I will do some research on existing methods / papers.

I did a little drawing to help me think :)

lsh

pixelogik commented 10 years ago

@piskvorky So regarding next steps. Of course help and working together would be great.

If you interested in this you could check out papers about density-sensitive LSHs, or "Density Sensitive Hashing". I did a first "google search" on that. There are some papers about stuff like that but it seems to be a field with not many publications.

pixelogik commented 10 years ago

@piskvorky Maybe using a tree structure in each individual LSH would solve that. kd-tree or range tree or something (just some buzzwords. i have to check on those details first). Because we can control the dimensionality of the LSHs we can keep them low to make these trees efficient.

For example one could use multiple RandomDiscretizedProjections hashes, each having three projection vectors (dim=3). In these projection spaces with dim=3 such trees might be efficient enough. The actual type of tree should focus on retrieval and insertion performance, I guess memory is not a big issue.

piskvorky commented 10 years ago

Thanks a lot for the great write-up!

I didn't think of the low density areas; my first, naive idea stopped at what your sketch suggests for "high density" areas :) Namely, automatically keep decreasing #projection vectors or increasing #hashes until every conceivable query returns >=k results, then keep that as the final index.

I'll check some research as well and think about your "tree" proposal.

pixelogik commented 10 years ago

My pleasure. I really like this topic and would enjoy it, if NearPy would improve collaboratively.

pixelogik commented 9 years ago

Will close this because of other ticket "Experiment with binary trees". Will try to do that soon.

pixelogik commented 9 years ago

So I added a new hash called RandomBinaryProjectionTree that uses a binary tree to guarantee always N results (when combined with a NearestFilter(N)), if you still are looking for that (took some time). Cheers

piskvorky commented 9 years ago

Yes I am :)

Is this the same algo as used in Annoy? Though there were multiple trees used in Annoy IIRC, so I guess not.

I'll plug this implementation into the "shootout benchmark", and let you know how it fares.

pixelogik commented 9 years ago

I am not sure, have not read the annoy source code yet. This is kind of a naive implementation, not really tested for quality / performance. So getting benchmark feedback would be awesome!

Would be a wonder if it is good right from the start. I wanted to do experiments with data on my own soon, so right now it is really a shot in the dark :)

pixelogik commented 9 years ago

Do you have predefined configurations for the hashes? I mean the projection count? Because for given feature space dimension and data set size the projection count is relevant to the quality.

I added this "candidate_count" method to the engine to get information about how many candidates are retrieved from buckets for a given query vector. this allows optimizing the hash by increasing projection count if candidate count is too high and lowering it if the candidate count is too low.

I want to make experiments on this to make a rule of thumb how to pick projection count depeneding on feature space dimensionality and data set size, soon also.

piskvorky commented 9 years ago

You mean in Annoy? There's only a param controlling the number of trees (perf/precision tradeoff) -- I evaluated various choices for this param in the "wikipedia shootout benchmark", from 1 tree to 500.

pixelogik commented 9 years ago

Ok. If you want to check different configurations with the new NearPy hash, you could vary the projection count, that is the main parameter there.

pixelogik commented 9 years ago

So apparently the RandomBinaryProjectionTree sucks because it is not paying respect to Hamming distance (Xing's contribution made that clear).

I am currently experimenting with different approaches that pay respect to Hamming distance of binary bucket keys. I think it does not make sense to perform experiments on NearPy right now because performance would strongly vary with configuration of Engine. I will try to make this simpler.

duhaime commented 8 years ago

Hi @pixelogik, have you by chance had an opportunity to renew work on this area? I'm also very interested in specifying / increasing k, and would love to hear if the more recent commits have contributed to this functionality.

MaxwellRebo commented 8 years ago

What was the resolution to this?

ayush488 commented 7 years ago

So does NEarpy has a parameter to input the 'k' value? Instead of returning only 10 NN, I want to get user given 'k' neighbors. How can I do that?