microsoft / SPTAG

A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.
MIT License
4.79k stars 583 forks source link

the SPTAGClient.AnnClient.Search method treats query and input vectors differently wrt normalization #82

Open RZachLamberty opened 5 years ago

RZachLamberty commented 5 years ago

Bug description when we build an SPTAGClient.AnnClient object off of a ANN server which loaded a Index.DistCalcMethod=Cosine (default) index, I expect that if I Search for any vector in that index (including non-unit vectors) that the nearest neighbor returned by Search should be that vector itself, and it should have distance=0. the current behavior is to actually return 1 - np.linalg.norm(non_unit_vector)

To Reproduce Steps to reproduce the behavior:

  1. copy the Singlebox Python Wrapper example in the GettingStart.md file
  2. edit the last two lines to replace L2 with Cosine
  3. run this file

Expected behavior for 10-element query vectors [0, 0, ..., 0], [2, 2, ..., 2], and [4, 4, ..., 4], the cosine distance of every vector in the input index (all [n, n, ..., n] for `0 < n < 100) should be exactly 0 (they have different magnitudes but the same direction)

Observed Behavior the measured distance is not 1 - CosineSim(x, y), but instead 1 - |x| * CosineSim(x, y). for the three test query vectors, this means we see

Desktop (please complete the following information): using the current Dockerfile build

RZachLamberty commented 4 years ago

is there any update on this issue?