A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.
MIT License
4.79k
stars
583
forks
source link
the SPTAGClient.AnnClient.Search method treats query and input vectors differently wrt normalization #82
Bug description
when we build an SPTAGClient.AnnClient object off of a ANN server which loaded a Index.DistCalcMethod=Cosine (default) index, I expect that if I Search for any vector in that index (including non-unit vectors) that the nearest neighbor returned by Search should be that vector itself, and it should have distance=0. the current behavior is to actually return 1 - np.linalg.norm(non_unit_vector)
Expected behavior
for 10-element query vectors [0, 0, ..., 0], [2, 2, ..., 2], and [4, 4, ..., 4], the cosine distance of every vector in the input index (all [n, n, ..., n] for `0 < n < 100) should be exactly 0 (they have different magnitudes but the same direction)
Observed Behavior
the measured distance is not 1 - CosineSim(x, y), but instead 1 - |x| * CosineSim(x, y). for the three test query vectors, this means we see
Bug description when we build an
SPTAGClient.AnnClient
object off of a ANN server which loaded aIndex.DistCalcMethod=Cosine
(default) index, I expect that if ISearch
for any vector in that index (including non-unit vectors) that the nearest neighbor returned bySearch
should be that vector itself, and it should havedistance=0
. the current behavior is to actually return1 - np.linalg.norm(non_unit_vector)
To Reproduce Steps to reproduce the behavior:
GettingStart.md
fileL2
withCosine
Expected behavior for 10-element query vectors
[0, 0, ..., 0]
,[2, 2, ..., 2]
, and[4, 4, ..., 4]
, the cosine distance of every vector in the input index (all[n, n, ..., n]
for `0 < n < 100) should be exactly 0 (they have different magnitudes but the same direction)Observed Behavior the measured distance is not
1 - CosineSim(x, y)
, but instead1 - |x| * CosineSim(x, y)
. for the three test query vectors, this means we see[0, 0, 0]
[-5.324554920196533, -5.324554920196533, -5.324554920196533]
(note:-5.324554920196533 = 1 - 1 - np.sqrt(10 * (2 ** 2)) = 1 - np.linalg.norm(q[1])
[-11.649109840393066, -11.649109840393066, -11.649109840393066]
(note:-11.649109840393066 = 1 - 1 - np.sqrt(10 * (4 ** 2)) = 1 - np.linalg.norm(q[2])
Desktop (please complete the following information): using the current
Dockerfile
build