microsoft / DiskANN

Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search
Other
1.02k stars 208 forks source link

[BUG] Low recall rate on a custom dataset #537

Closed igmor closed 4 months ago

igmor commented 4 months ago

Expected Behavior

I've ran benchmarks from DiskANN repo on a custom dataset with 100K vectors of 512 dimensions and got some strange results, more specifically low recall rate that goes down as L goes up and low QPS.

I would normally expect QPS to go down as dimensions go up but not that dramatically.

Actual Behavior

Here is the results

Loading the cache list into memory....done.
     L   Beamwidth             QPS    Mean Latency    99.9 Latency        Mean IOs         CPU (s)       Recall@10
=============================================================================================
    10           2           70.72         6706.17        25599.00            9.81          668.29           66.52
    20           2           60.99        11898.02        22016.00           18.96          980.30           46.50
    30           2           44.41        23712.32       195452.00           28.11         1261.76           43.39
    40           2           33.59        38685.81       326683.00           37.23         1606.08           43.05
    50           2           26.99        52989.25       394442.00           46.21         1943.90           42.04
   100           2           13.76       124475.75       566027.00           90.94         3552.57           36.63

Example Code

I can share a sample of dataset to run on your side

Dataset Description

Please tell us about the shape and datatype of your data, (e.g. 128 dimensions, 12.3 billion points, floats)

Error

see above, search results are weird

Your Environment