microsoft / DiskANN

Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search
Other
1.02k stars 208 forks source link

[BUG] Cosine + StaticMemoryIndex not working #533

Open daxpryce opened 5 months ago

daxpryce commented 5 months ago

Expected Behavior

Created an index via diskannpy.build_memory_index(), then searched it via diskannpy.StaticMemoryIndex.search()

# points is a np.ndarray of dtype np.float32, ~ 16m points x 1024 dims

dap.build_memory_index(
    data=points, 
    distance_metric="cosine", 
    index_directory="/home/daxpryce/data/index", 
    complexity=128, 
    graph_degree=100, 
    num_threads=0
)

...

index = dap.StaticMemoryIndex(
    index_directory="/home/daxpryce/data/index", 
    initial_search_complexity=128, 
    num_threads=0
)

...

index.search(query_vector, k_neighbors=25, complexity=128)

I expect this to give me the 25 ANN identifiers and their respective distances, with 0 being the closest and 1 or 2 being the most different (note: I don't actually know if it should be 1 or 2 here)

Actual Behavior

I get back the nearest neighbors in the identifiers part of the NamedTuple, but distances are coming back as:

QueryResponse(identifiers=array([ 5345615,  3628601,  4692239,  8175590, 13827252,   125668,
        5478736,   830329,  7447148,  2656141,  7112823, 15433974,
       11870150, 16225298,  1824701,  7252223, 13124497,  4977132,
       11459158,  2871678,  9336432,  3713107, 13674569,  4685049,
        3025678], dtype=uint32), distances=array([-27.868996, -24.84326 , -24.837467, -24.808733, -24.793375,
       -24.774914, -24.771217, -24.758133, -24.736687, -24.698872,
       -24.68946 , -24.63995 , -24.63334 , -24.623182, -24.619724,
       -24.584936, -24.578463, -24.56511 , -24.550564, -24.539438,
       -24.507341, -24.501265, -24.500622, -24.490398, -24.486555],
      dtype=float32))

record scratch -27.86? come again? cosine? no.

Dataset Description

Please tell us about the shape and datatype of your data, (e.g. 128 dimensions, 12.3 billion points, floats)

Your Environment

Additional Details

The query vectors aren't being normalized prior to searching. The relative results seem to be correct, but if I manually normalize the query vector and then search, I get distances within ranges I would fully expect. Using scipy.spatial.distance.cdist(..., ..., metric="cosine") I verified that these distances are not, in fact, -27. I also verified that if I manually normalize the query vector prior to search, I do get identical distances.

Long story short, search does not always normalize the query vector. I also verified this happens on the native diskann class object created via https://github.com/microsoft/DiskANN/blob/9e876376349132236723d1d167c071887d807e9c/python/src/static_memory_index.cpp#L34

raw_index = dap._diskannpy.StaticMemoryFloatIndex(
    distance_metric=dap._diskannpy.COSINE,
    num_points=16287459,
    dimensions=1024,
    index_path="/home/daxpryce/data/chatlogs/e5/v2/index/ann",
    num_threads=0,
    initial_search_complexity=128,
)

raw_index.search(query_vector, 25, 128)

Results:

(array([ 5345615,  3628601,  4692239,  8175590, 13827252,   125668,
         5478736,   830329,  7447148,  2656141,  7112823, 15433974,
        11870150, 16225298,  1824701,  7252223, 13124497,  4977132,
        11459158,  2871678,  9336432,  3713107, 13674569, 11299587,
         4685049], dtype=uint32),
 array([-27.868996, -24.84326 , -24.837467, -24.808733, -24.793375,
        -24.774914, -24.771217, -24.758133, -24.736687, -24.698872,
        -24.68946 , -24.63995 , -24.63334 , -24.623182, -24.619724,
        -24.584936, -24.578463, -24.56511 , -24.550564, -24.539438,
        -24.507341, -24.501265, -24.500622, -24.493336, -24.490398],
       dtype=float32))

When I normalize it:

norm = np.linalg.norm(query_vector, keepdims=True)
norm_vec = query_vector / norm
index.search(norm_vec, 25, 128)

Results:

QueryResponse(identifiers=array([ 5345615,  3628601,  4692239,  8175590, 13827252,   125668,
        5478736,   830329,  7447148,  2656141,  7112823, 15433974,
       11870150, 16225298,  1824701,  7252223, 13124497,  4977132,
       11459158,  2871678,  9336432,  3713107, 13674569, 11299587,
        4685049], dtype=uint32), distances=array([3.57627869e-07, 1.04809523e-01, 1.05010152e-01, 1.06005609e-01,
       1.06537521e-01, 1.07176900e-01, 1.07304990e-01, 1.07758224e-01,
       1.08501136e-01, 1.09810948e-01, 1.10136986e-01, 1.11852050e-01,
       1.12080872e-01, 1.12432718e-01, 1.12552464e-01, 1.13757730e-01,
       1.13982022e-01, 1.14444435e-01, 1.14948213e-01, 1.15333557e-01,
       1.16445422e-01, 1.16655886e-01, 1.16678119e-01, 1.16930664e-01,
       1.17032349e-01], dtype=float32))

non-diskannpy/diskann cosine distance calcs:

comparisons = [3628601,  4692239,  8175590, 13827252,   125668, 5478736,   830329,  7447148,  2656141,  7112823, 15433974, 11870150, 16225298,  1824701,  7252223, 13124497,  4977132, 11459158,  2871678,  9336432,  3713107, 13674569, 11299587, 4685049 ]
vecs = all_points[comparisons]

import scipy
scipy.spatial.distance.cdist(np.array([query_vector]), vecs, metric="cosine")

Results: Note: I didn't bother with vector @ 5345615, as that was the query vector.

array([[0.10480966, 0.10500997, 0.10600552, 0.10653754, 0.1071771 ,
        0.10730503, 0.10775789, 0.10850099, 0.10981107, 0.11013656,
        0.11185234, 0.11208073, 0.11243275, 0.11255261, 0.11375798,
        0.11398193, 0.11444477, 0.11494837, 0.11533353, 0.11644555,
        0.11665589, 0.11667847, 0.11693091, 0.11703238]])

Last note: I am like 99% confident I have seen this work properly.