pixelogik / NearPy

Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes.
MIT License
763 stars 151 forks source link

Empty Sequence returned by Nearest Neighbour query #59

Closed deshanadesai closed 8 years ago

deshanadesai commented 8 years ago

Hi,

I am using the example usage code snippet to hash vectors into buckets and find the nearest neighbour. I have stored 20 vectors of 4096 dimensions each in the engine. Querying the engine for the nearest neighbour of a test vector occasionally returns an empty sequence.

I tried increasing the number of random projections to 4000 with the same results. Can you help me to understand why this is happening?

Thanks! Deshana

amorgun commented 8 years ago

@deshanadesai Can you provide an example code?

deshanadesai commented 8 years ago

@amorgun - Here is an example code:

dimension = 4096
rbp = RandomBinaryProjections('rbp', 50)
engine = Engine(dimension, lshashes=[rbp])

for index in range(0,20):
    # fetch a numpy array 'arr' of shape (4096,1)
    engine.store_vector(arr, 'data_%d' % index)

# Passing a query vector to search for the nearest neighbour:
N = engine.neighbours(query)
print N

Any help will be appreciated. Thanks!

amorgun commented 8 years ago

@deshanadesai It looks like the length of hash is too high. Effectively you split your space into 2^50 parts and at most 20 of them contains some points. If you choose a query randomly chance to find some neighbours is at most 20 / 2^50 = 1.7763568e-14.

Some code for experiments:

import numpy as np

from nearpy import Engine
from nearpy.hashes import RandomBinaryProjections

np.random.seed(123)

dimension = 4096
rbp = RandomBinaryProjections('rbp', 5, rand_seed=20)
engine = Engine(dimension, lshashes=[rbp])
vectors = [np.random.randn(dimension) for _ in range(20)]
for index, vector in enumerate(vectors):
    engine.store_vector(vector, 'data_%d' % index)
for vector in vectors:
    N = len(engine.neighbours(vector))
    assert N > 0

N_random = [len(engine.neighbours(np.random.randn(dimension)))
            for _ in range(1000)]
print("Nonzero: %d out of 1000" % sum(i > 0 for i in N_random))
deshanadesai commented 8 years ago

I understood this. Thank you very much @amorgun . It would be nice to have a method that selects the optimum number of binary projections. I would be willing to further read up on this and contribute to the code if you think it is worthwhile.