pixelogik / NearPy

Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes.
MIT License
759 stars 152 forks source link

stored vector revised #86

Open xyyimian opened 5 years ago

xyyimian commented 5 years ago

I was going to compute recall rate using NearPy. But I found that the vectors I stored has been revised.

The following is what I did. I stored a bunch of vectors in engine. And I did recall_list = engine.neighbours(vectors[0]) I printed out recall_list[0], the third element, which is the distance, shows that recall_list[0] is just vectors[0] since the distance is 0.0. But I compared concrete vector element value and found that the vectors has been revised.

That's why I can not index the recalled vector in my original vectors.

I don't know what's wrong. Thanks for your reply.

xyyimian commented 5 years ago

I just found the reason why the vectors are revised. The library used unitvec to normalized vectors before storing. But I think it will cause bugs if users want to use euclidean metric.

I have done a test. I stored [1,1,1.1] and [0.1,0.1,0.1] in the engine and query with [1,1,1]. I got normalized [0.1,0.1,0.1] as nearest neighbor.

pixelogik commented 5 years ago

@xyyimian You are right. For euclidian distance usage the normalization is just a bug. This probably was not detected before because the default is cosine and most people use cosine anyway.

We should add a new function to the Distance class, called _normalizevector(), which is used in the Engine. Do you want to do this? If not I will do it at some point in the future.