pixelogik / NearPy

Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes.
MIT License
759 stars 152 forks source link

Crashing in storing and retrieve when data is string or list of string #9

Closed linearregression closed 10 years ago

linearregression commented 10 years ago

Try this library when data is a string or a list of string. Expect storage a success and get nearest neighbor etc upon querying.

Code: import numpy from numpy import random from nearpy import Engine from nearpy.hashes import RandomBinaryProjections from nearpy.filters import NearestFilter from nearpy.distances import EuclideanDistance

dimension=5000 encoding_length = 10 rbp = RandomBinaryProjections('rbp', encoding_length) #or PCABinaryProjection engine = Engine(dimension, lshashes=[rbp], distance=EuclideanDistance(), vector_filters=[NearestFilter(encoding_length)]) engine.clean_all_buckets() i = numpy.random.randn(dimension) s = 'abcde' engine.store_vector(i, s) y = engine.neighbours(s)

CRASH: Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/nearpy/engine.py", line 93, in neighbours for bucket_key in lshash.hash_vector(v): File "/usr/local/lib/python2.7/dist-packages/nearpy/hashes/randombinaryprojections.py", line 63, in hash_vector return [''.join(['1' if x > 0.0 else '0' for x in projection])] TypeError: 'NotImplementedType' object is not iterable ret = engine.neighbours(s)

pixelogik commented 10 years ago

You are calling engine.neighbours with the data s as parameter. This is invalid usage of the method. The method expects the query vector as a parameter. This is what the engine does: Indexing data (your string) using the vector. When you later want to retrieve data, you specify a query vector to get the nearest neighbours (vectors) with their data.

The methods are very well documented in code, it really helps to dive into that to understand how to use the library:

def neighbours(self, v):
"""
Hashes vector v, collects all candidate vectors from the matching
buckets in storage, applys the (optional) distance function and
finally the (optional) filter function to construct the returned list
of either (vector, data, distance) tuples or (vector, data) tuples.
"""

So in your case just replace the s with the i in your call and you get the same vector with its data:

y = engine.neighbours(i)
print y[0]

this gives

(array([ 1.39976552, -0.24555724,  0.79080975, ...,  0.90452441, -0.11857054,  0.74494529]),
'abcde', 0.0)

the first entry y[0][0] is the nearest vector (in this case i itself), the second entry y[0][1] is the data (your string 'abcde') and the third entry y[0][2] is the distance of the vector to the query vector. in this case 0.0 because i is used as query vector.

So you get the same result if you use a query vector that is different from i, but close:

i = i + 0.0001
y = engine.neighbours(i)
print y[0]

also gives

(array([ 1.39976552, -0.24555724,  0.79080975, ...,  0.90452441, -0.11857054,  0.74494529]),
'abcde', 0.0)

that is the idea of LSH. You mostly do not query with the same vectors but with different vectors because you want to get indexed vectors and data that is close ("near") to your query. For example if you are looking for images that are similar to each other in a feature space that you construct. You query with a feature vector of one image, and get tons of other vectors from the engine, that represent other images (the data could be the id from a database where the images are stored).

If you have more questions, please ask.