pixelogik / NearPy

Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes.
MIT License
763 stars 151 forks source link

Nearest Neighbours in Cosine distance #19

Closed shixing closed 9 years ago

shixing commented 10 years ago

A vector V is hashed into a bucket B, the V is compared with all vectors in B. This only bucket B usually leads to bad performance.

A good way to solve it is to find several closest neighbour buckets in term of hamming distance. e.g. B= "1001", the neighbour buckets could be "1000" ,"1011" ... Then compare V with all vectors in these buckets.

How do you think about it ? I've implemented an extension class based on your class Engine and would like to incorporate into your code.

Thanks, Xing

pixelogik commented 10 years ago

Hi Xing,

it is true, if you only use one hash, only one bucket is returned for a query vector.

You can use multiple hashes in the engine to get better results. Doing this is also a good idea because the quality of retrieval really varies with different random projections.

Another possibility to get more than one bucket is to use the new RandomBinaryProjectionTree hash. But I have not done experiments on it regarding quality / performance. Will do so soon.

So yeah, having another method to get "close" buckets would be cool. Why don't you fork and do a pull request or send code to me at ole@pixelogik.de ? I am always happy if people wanna help to improve NearPy!

Best, Ole

pixelogik commented 10 years ago

So do you wanna contribute it? Would be really awesome! Neighbour buckets sounds cool. Do you use a tree to speed up bucket-similarity-computation?

shixing commented 10 years ago

Yes, I do want to contribute. I use several sorted lists of random permuted keys to find the similar buckets.

For example, 1) a 01-key list : ['0001','1000','0101','0100'] 2) I may have 1 permuted mapping: [0,1,2,3] -> [1,0,3,2], e.g '0001' => '0010' , 3) and the list would be ['0010','0100','1010','1000'] 4) then sort the list [('0010','0001'),('0100','1000'),('1000','0100'),('1010','1010')] # [(permuted key, original key)]

Then, suppose the query key is '0010', and want to find its neighbours: 1) Permute the query key: '0010' => '0001', 2) Binary Search ('0001','0010') in the permuted key list. Return place 0. 3) Count the b neighbours of place 0 as neibour keys. Here, suppose b = 1, then the neibour keys would be ("0010",'"0001"). And original key "0001" is a neighbour key of "0010" 4) Actually, you need several permuted key list to find the neighbours. And this is still an approximated neighbours.

The method is described in detail in the attached paper (section 3).

Thanks, Xing

On Wed, Aug 20, 2014 at 8:41 AM, Ole Sprause notifications@github.com wrote:

So do you wanna contribute it? Would be really awesome! Neighbour buckets sounds cool. Do you use a tree to speed up bucket-similarity-computation?

— Reply to this email directly or view it on GitHub https://github.com/pixelogik/NearPy/issues/19#issuecomment-52796837.

Xing Shi 史 兴

PhD Candidate

Department of Computer Science

University of Southern California

Los Angeles, CA, 90007

Home Page: xingshi.me

shixing commented 10 years ago

The problem is that this method could only works in terms of hamming distance. So if others use non-binary keys, it won't work. Currently, I just use one 15-bits-randombinaryprojects hashing method, plus this similar-key methods. I wonder how should my code integrate into your well-designed framework?

Here is my plan:

I will add one component : permutation, which is responsible for building the permuted index and return the neighbour keys.

For the function neighbours, I will change as this :

def neighbours(self, v): candidates = [] for lshash in self.lshashes: for bucket_key in lshash.hash_vector(v, querying=True): bucket_content = [] if isinstanceof(lshash,randombinaryprojects): neighbour_keys = self.permutation.get_neighbour_keys(lshash.hash_name,bucket_key) bucket_content = [] for bkey in neighbour_keys: bucket_content += self.storage.get_bucket(lshahsh.hash_name,bkey) else: bucket_content = self .storage.get_bucket(lshash.hash_name, bucket_key) #print 'Bucket %s size %d' % (bucket_key, len(bucket_content)) candidates.extend(bucket_content) #print 'Candidate count is %d' % len(candidates) # Apply distance implementation if specified if self.distance: candidates = [(x[0], x[1], self.distance. distance(x[0], v)) for x in candidates] # Apply vector filters if specified and return filtered list if self.vector_filters: filter_input = candidates for vector_filter in self.vector_filters: filter_input = vector_filter.filter_vectors(filter_input) # Return output of last filter return filter_input

How do you think about it ?

Thanks, Xing

On Wed, Aug 20, 2014 at 9:51 AM, Xing Shi shixing19910105@gmail.com wrote:

Yes, I do want to contribute. I use several sorted lists of random permuted keys to find the similar buckets.

For example, 1) a 01-key list : ['0001','1000','0101','0100'] 2) I may have 1 permuted mapping: [0,1,2,3] -> [1,0,3,2], e.g '0001' => '0010' , 3) and the list would be ['0010','0100','1010','1000'] 4) then sort the list [('0010','0001'),('0100','1000'),('1000','0100'),('1010','1010')] # [(permuted key, original key)]

Then, suppose the query key is '0010', and want to find its neighbours: 1) Permute the query key: '0010' => '0001', 2) Binary Search ('0001','0010') in the permuted key list. Return place 0. 3) Count the b neighbours of place 0 as neibour keys. Here, suppose b = 1, then the neibour keys would be ("0010",'"0001"). And original key "0001" is a neighbour key of "0010" 4) Actually, you need several permuted key list to find the neighbours. And this is still an approximated neighbours.

The method is described in detail in the attached paper (section 3).

Thanks, Xing

On Wed, Aug 20, 2014 at 8:41 AM, Ole Sprause notifications@github.com wrote:

So do you wanna contribute it? Would be really awesome! Neighbour buckets sounds cool. Do you use a tree to speed up bucket-similarity-computation?

— Reply to this email directly or view it on GitHub https://github.com/pixelogik/NearPy/issues/19#issuecomment-52796837.

Xing Shi 史 兴

PhD Candidate

Department of Computer Science

University of Southern California

Los Angeles, CA, 90007

Home Page: xingshi.me

Xing Shi 史 兴

PhD Candidate

Department of Computer Science

University of Southern California

Los Angeles, CA, 90007

Home Page: xingshi.me

pixelogik commented 10 years ago

That it only works for binary-only keys is fine. Most people use binary random hashes anyway.

I will take time on the weekend to read into your method and review the pull request. So far, big thanks for the contribution! Always cool to learn new approaches.

shixing commented 10 years ago

Hi Ole,

A question when using LSH.

Now I have millions of vectors, when I just use a random binary hashing, the buckets are not balanced -- most of the buckets have only one vector, whereas some buckets have about 2000 vectors. Is there any technical that can break down these big buckets, i.e. generate the hyperplane according two the data instead of randomly pre-generated ?

Thanks, Xing

On Thu, Aug 21, 2014 at 6:27 AM, Ole Sprause notifications@github.com wrote:

That it only works for binary-only keys is fine. Most people use binary random hashes anyway.

I will take time on the weekend to read into your method and review the pull request. So far, big thanks for the contribution! Always cool to learn new approaches.

— Reply to this email directly or view it on GitHub https://github.com/pixelogik/NearPy/issues/19#issuecomment-52919014.

Xing Shi 史 兴

PhD Candidate

Department of Computer Science

University of Southern California

Los Angeles, CA, 90007

Home Page: xingshi.me

pixelogik commented 10 years ago

There is a new class RandomBinaryProjectionTree to cope with this. It uses a binary tree to combine buckets in order to get always N results. So you could set the projection count so high that most buckets have small content. There is example code in the README of NearPy.

I have not tested the RandomBinaryProjectionTree with data / experiments, maybe this would be a good opportunity?

pixelogik commented 10 years ago

Also: You could use PCABinaryProjections to use hyperplanes learned by the data.

shixing commented 10 years ago

It seems a great thing, I'll test the random-binary-tree-projections.

On Fri, Aug 22, 2014 at 5:24 AM, Ole Sprause notifications@github.com wrote:

There is a new class RandomBinaryProjectionTree to cope with this. It uses a binary tree to combine buckets in order to get always N results. So you could set the projection count so high that most buckets have small content. There is example code in the README of NearPy.

I have not tested the RandomBinaryProjectionTree with data / experiments, maybe this would be a good opportunity?

— Reply to this email directly or view it on GitHub https://github.com/pixelogik/NearPy/issues/19#issuecomment-53054171.

Xing Shi 史 兴

PhD Candidate

Department of Computer Science

University of Southern California

Los Angeles, CA, 90007

Home Page: xingshi.me

shixing commented 10 years ago

Hi Ole,

I've looked at your code of BinaryProjectionTree. Your goal is to provide the user exact n results.The bucket keys under a certain subtree have the same prefix, but same prefix doesn't mean these buckets key have small hamming distance with the matching bucket key. This may cause some issues, but anyway, the goal is to provide approximate neighbours, this should be fine.

Thx, Xing

On Fri, Aug 22, 2014 at 10:34 AM, Xing Shi shixing19910105@gmail.com wrote:

It seems a great thing, I'll test the random-binary-tree-projections.

On Fri, Aug 22, 2014 at 5:24 AM, Ole Sprause notifications@github.com wrote:

There is a new class RandomBinaryProjectionTree to cope with this. It uses a binary tree to combine buckets in order to get always N results. So you could set the projection count so high that most buckets have small content. There is example code in the README of NearPy.

I have not tested the RandomBinaryProjectionTree with data / experiments, maybe this would be a good opportunity?

— Reply to this email directly or view it on GitHub https://github.com/pixelogik/NearPy/issues/19#issuecomment-53054171.

Xing Shi 史 兴

PhD Candidate

Department of Computer Science

University of Southern California

Los Angeles, CA, 90007

Home Page: xingshi.me

Xing Shi 史 兴

PhD Candidate

Department of Computer Science

University of Southern California

Los Angeles, CA, 90007

Home Page: xingshi.me

pixelogik commented 10 years ago

Hi Xing,

yes that is true. The intention on that class was to try a simple way to retrieve always N. Hamming distance was not an issue in the process.

That's why your contribution is cool and I will check the pull request tomorrow :)

Best, Ole

pixelogik commented 10 years ago

I close this issue because Xing provided a new component that performs permutations of binary bucket keys.

pixelogik commented 10 years ago

H again Xing,

after thinking a bit about Hamming-distance optimized meta-hashes and working on another variant of this approach I figured that your component could need some more documentation and comments.

Could you maybe do a fork of the develop branch and add documentation and comments to the three files permutation.py, permute.py and permutedIndex.py? The code is hard to read / understand. This would be really awesome because I want to keep the library easy to understand.

shixing commented 10 years ago

Okay. I'll do it tonight. On Aug 28, 2014 2:32 PM, "Ole Sprause" notifications@github.com wrote:

H again Xing,

after thinking a bit about Hamming-distance optimized meta-hashes and working on another variant of this approach I figured that your component could need some more documentation and comments.

Could you maybe do a fork of the develop branch and add documentation and comments to the three files permutation.py, permute.py and permutedIndex.py? The code is hard to read / understand. This would be really awesome because I want to keep the library easy to understand.

— Reply to this email directly or view it on GitHub https://github.com/pixelogik/NearPy/issues/19#issuecomment-53803955.

pixelogik commented 10 years ago

Awesome, thanks!

bduffany commented 8 years ago

Could someone post a link to the paper that was attached via e-mail (mentioned way early in this thread)? I'd really like to take a look at it :)

shixing commented 8 years ago

Here it is https://www.cs.cmu.edu/~hovy/papers/05ACL-clustering.pdf

On Fri, Dec 11, 2015 at 12:36 PM, Brandon Duffany notifications@github.com wrote:

Could someone post a link to the paper that was attached via e-mail (mentioned way early in this thread)? I'd really like to take a look at it :)

— Reply to this email directly or view it on GitHub https://github.com/pixelogik/NearPy/issues/19#issuecomment-164041723.

Xing Shi 史 兴

PhD Candidate

Department of Computer Science

University of Southern California

Los Angeles, CA, 90007

Home Page: xingshi.me