[Question] Low Recall in Filtered-Disk ANN on Sift Dataset-1M with Uniform and Zipf Distributions

Issue Summary: I am experimenting with the Filtered-Disk ANN algorithm (the memory version) on the Sift dataset (1M). I followed the steps outlined in the markdown file to generate synthetic labels, build the index, and perform the search. However, I'm encountering an issue with low recall levels when using distributions other than random, such as the one_per_point (uniform distribution) and the Zipf distribution.

Expected Behavior: I expected the recall levels to be consistent across different label distributions, as suggested in the Filtered-Disk ANN paper. Specifically, I expected the recall to remain high even with distributions other than random.

Observed Behavior: When using the one_label_per_row and Zipf distributions, the recall levels are significantly lower compared to the random distribution. The only difference is that in the random setting we're dealing with 50% specificity, in the other cases the specificity is low (5%)

Steps to Reproduce:

I used the same parameters as described in the Filtered-Disk ANN paper. I have attached the Bash script to reproduce the experiment.

#!/bin/bash

function json_time {
  command="$@"
  echo "Executing $command"
  /usr/bin/time --quiet -o /home/anas.aitaomar/logs/time.log -a --format '{"command":"%C", "wallclock": %e, "user": %U, "sys": %S}' $command
  ret=$?
  if [ $ret -ne 0 ]; then
    echo "{\"command\": \"$command\", \"status_code\": $ret}" >> /home/anas.aitaomar/logs/time.log
  fi
}

rm -rf data
mkdir data
rm /home/anas.aitaomar/logs/time.log
touch /home/anas.aitaomar/logs/time.log
chmod 666 /home/anas.aitaomar/logs/time.log

if [ -d "build/home/anas.aitaomar/s" ]; then
    export BASE_PATH="build/home/anas.aitaomar/s"
else
    export BASE_PATH="build/apps"
fi

# var definition
INDEX_TYPE="filtered_vanama"
# INDEX_TYPE="stitched_vanama"

# LABELS_DISTRIBUTION="zipf"
# LABELS_DISTRIBUTION="random"
LABELS_DISTRIBUTION="one_per_point"

if [ "$INDEX_TYPE" = "filtered_vanama" ]; then
    R="96"
    L="90"
else
    R="32"
    L="100"
fi

R_STITCHED="64"
L_FILTER=$L
ALPHA="1.2"
L_SEARCH="10 100 600 650"

LABEL_COUNT="12"
LABEL_TO_SEARCH="1"

NEIGHBORS_COUNT="10"
THREADS_COUNT="48"

json_time $BASE_PATH/utils/fvecs_to_bin float sift/sift_base.fvecs data/sift_base.bin
json_time $BASE_PATH/utils/fvecs_to_bin float sift/sift_query.fvecs data/sift_query.bin

# generate labels
json_time $BASE_PATH/utils/generate_synthetic_labels  --num_labels $LABEL_COUNT --num_points 1000000  --output_file data/labels_base.txt --distribution_type $LABELS_DISTRIBUTION

# labels dist 
echo "----------------------------------------------------"
echo ""
json_time $BASE_PATH/utils/stats_label_data --labels_file data/labels_base.txt --universal_label 0
echo ""
echo "---------------------------------------------------"
# generate gt - label x(12)  
json_time $BASE_PATH/utils/compute_groundtruth_for_filters --data_type float --dist_fn l2 --base_file data/sift_base.bin --query_file data/sift_query.bin --gt_file data/sift_gt.bin --K $NEIGHBORS_COUNT --filter_label $LABEL_TO_SEARCH --label_file data/labels_base.txt 

if [ "$INDEX_TYPE" = "filtered_vanama" ]; then
    # build index 
    json_time $BASE_PATH/build_memory_index  --data_type float --dist_fn l2 --data_path data/sift_base.bin --index_path_prefix data/sift_filtered_index -R $R  --FilteredLbuild $L_FILTER --alpha $ALPHA --label_file data/labels_base.txt --universal_label 0 --num_threads $THREADS_COUNT
else
    # build stitched-index 
    json_time $BASE_PATH/build_stitched_index  --data_type float --data_path data/sift_base.bin --index_path_prefix data/sift_filtered_index -R $R -L $L --stitched_R $R_STITCHED --alpha $ALPHA --label_file data/labels_base.txt --universal_label 0 --num_threads $THREADS_COUNT
fi

# search 
json_time $BASE_PATH/search_memory_index  --data_type float --dist_fn l2 --index_path_prefix data/sift_filtered_index --query_file data/sift_query.bin --gt_file data/sift_gt.bin --filter_label $LABEL_TO_SEARCH -K $NEIGHBORS_COUNT -L $L_SEARCH --result_path data/filtered_search_results --num_threads $THREADS_COUNT

Results

one_per_point

Reading truthset file data/sift_gt.bin ...
Metadata: #pts = 10000, #dims = 10... 
L2: Using AVX2 distance computation DistanceL2Float
Resizing took: 0.0387484s
From graph header, expected_file_size: 379395480, _max_observed_degree: 96, _start: 123742, file_frozen_pts: 0
Loading vamana graph data/sift_filtered_index...done. Index has 1000000 nodes and 93848864 out-edges, _start is set to 123742
Identified 13 distinct label(s)
Num frozen points:0 _nd: 1000000 _start: 123742 size(_location_to_tag): 0 size(_tag_to_location):0 Max points: 1000000
Index loaded
Using 48 threads to search
Ls         QPS     Avg dist cmps  Mean Latency (mus)   99.9 Latency   Recall@10
=================================================================================
10    51225.34            296.26              869.38       24950.13       46.54
100    13521.95           1165.10             3472.06       30612.68       49.84
600     3233.68           4394.70            14768.81       55484.78       49.64
650     3143.47           4660.87            14896.70       55146.52       49.64
Done searching. Now saving results

Random

Metadata: #pts = 10000, #dims = 10... 
L2: Using AVX2 distance computation DistanceL2Float
Resizing took: 0.0366022s
From graph header, expected_file_size: 380582992, _max_observed_degree: 96, _start: 123742, file_frozen_pts: 0
Loading vamana graph data/sift_filtered_index...done. Index has 1000000 nodes and 94145742 out-edges, _start is set to 123742
Identified 13 distinct label(s)
Num frozen points:0 _nd: 1000000 _start: 123742 size(_location_to_tag): 0 size(_tag_to_location):0 Max points: 1000000
Index loaded
Using 48 threads to search
Ls         QPS     Avg dist cmps  Mean Latency (mus)   99.9 Latency   Recall@10
=================================================================================
10    43119.56            599.91             1056.97       29310.13       81.11
100     8915.00           2456.43             5341.05       30007.46       98.84
600     1743.98           8778.31            27424.71       92325.24       99.86
650     1612.75           9283.70            29650.47       90650.07       99.87
Done searching. Now saving results s

Questions: I'm trying to understand the cause of this low recall, is it because of the attribute distribution or linked directly to the specificity or maybe the set of parameters used. Thanks in advance.

microsoft / DiskANN

[Question] Low Recall in Filtered-Disk ANN on Sift Dataset-1M with Uniform and Zipf Distributions #550