microsoft / DiskANN

Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search
Other
1.02k stars 208 forks source link

[Question] Low Recall in Filtered-Disk ANN on Sift Dataset-1M with Uniform and Zipf Distributions #550

Open AnasAito opened 3 months ago

AnasAito commented 3 months ago

Issue Summary: I am experimenting with the Filtered-Disk ANN algorithm (the memory version) on the Sift dataset (1M). I followed the steps outlined in the markdown file to generate synthetic labels, build the index, and perform the search. However, I'm encountering an issue with low recall levels when using distributions other than random, such as the one_per_point (uniform distribution) and the Zipf distribution.

Expected Behavior: I expected the recall levels to be consistent across different label distributions, as suggested in the Filtered-Disk ANN paper. Specifically, I expected the recall to remain high even with distributions other than random.

Observed Behavior: When using the one_label_per_row and Zipf distributions, the recall levels are significantly lower compared to the random distribution. The only difference is that in the random setting we're dealing with 50% specificity, in the other cases the specificity is low (5%)

Steps to Reproduce:

I used the same parameters as described in the Filtered-Disk ANN paper. I have attached the Bash script to reproduce the experiment.

#!/bin/bash

function json_time {
  command="$@"
  echo "Executing $command"
  /usr/bin/time --quiet -o /home/anas.aitaomar/logs/time.log -a --format '{"command":"%C", "wallclock": %e, "user": %U, "sys": %S}' $command
  ret=$?
  if [ $ret -ne 0 ]; then
    echo "{\"command\": \"$command\", \"status_code\": $ret}" >> /home/anas.aitaomar/logs/time.log
  fi
}

rm -rf data
mkdir data
rm /home/anas.aitaomar/logs/time.log
touch /home/anas.aitaomar/logs/time.log
chmod 666 /home/anas.aitaomar/logs/time.log

if [ -d "build/home/anas.aitaomar/s" ]; then
    export BASE_PATH="build/home/anas.aitaomar/s"
else
    export BASE_PATH="build/apps"
fi

# var definition
INDEX_TYPE="filtered_vanama"
# INDEX_TYPE="stitched_vanama"

# LABELS_DISTRIBUTION="zipf"
# LABELS_DISTRIBUTION="random"
LABELS_DISTRIBUTION="one_per_point"

if [ "$INDEX_TYPE" = "filtered_vanama" ]; then
    R="96"
    L="90"
else
    R="32"
    L="100"
fi

R_STITCHED="64"
L_FILTER=$L
ALPHA="1.2"
L_SEARCH="10 100 600 650"

LABEL_COUNT="12"
LABEL_TO_SEARCH="1"

NEIGHBORS_COUNT="10"
THREADS_COUNT="48"

json_time $BASE_PATH/utils/fvecs_to_bin float sift/sift_base.fvecs data/sift_base.bin
json_time $BASE_PATH/utils/fvecs_to_bin float sift/sift_query.fvecs data/sift_query.bin

# generate labels
json_time $BASE_PATH/utils/generate_synthetic_labels  --num_labels $LABEL_COUNT --num_points 1000000  --output_file data/labels_base.txt --distribution_type $LABELS_DISTRIBUTION

# labels dist 
echo "----------------------------------------------------"
echo ""
json_time $BASE_PATH/utils/stats_label_data --labels_file data/labels_base.txt --universal_label 0
echo ""
echo "---------------------------------------------------"
# generate gt - label x(12)  
json_time $BASE_PATH/utils/compute_groundtruth_for_filters --data_type float --dist_fn l2 --base_file data/sift_base.bin --query_file data/sift_query.bin --gt_file data/sift_gt.bin --K $NEIGHBORS_COUNT --filter_label $LABEL_TO_SEARCH --label_file data/labels_base.txt 

if [ "$INDEX_TYPE" = "filtered_vanama" ]; then
    # build index 
    json_time $BASE_PATH/build_memory_index  --data_type float --dist_fn l2 --data_path data/sift_base.bin --index_path_prefix data/sift_filtered_index -R $R  --FilteredLbuild $L_FILTER --alpha $ALPHA --label_file data/labels_base.txt --universal_label 0 --num_threads $THREADS_COUNT
else
    # build stitched-index 
    json_time $BASE_PATH/build_stitched_index  --data_type float --data_path data/sift_base.bin --index_path_prefix data/sift_filtered_index -R $R -L $L --stitched_R $R_STITCHED --alpha $ALPHA --label_file data/labels_base.txt --universal_label 0 --num_threads $THREADS_COUNT
fi

# search 
json_time $BASE_PATH/search_memory_index  --data_type float --dist_fn l2 --index_path_prefix data/sift_filtered_index --query_file data/sift_query.bin --gt_file data/sift_gt.bin --filter_label $LABEL_TO_SEARCH -K $NEIGHBORS_COUNT -L $L_SEARCH --result_path data/filtered_search_results --num_threads $THREADS_COUNT

Results