harsha-simhadri / big-ann-benchmarks

Framework for evaluating ANNS algorithms on billion scale datasets.
https://big-ann-benchmarks.com
MIT License
356 stars 118 forks source link

add hanns OOD solution #304

Closed AndrewHYu closed 3 months ago

AndrewHYu commented 3 months ago

Our OOD track solution consists of a vamana index, a mutil-scale spatial clustering index, and a layout-optimized quantization acceleration index. The entire retrieval process is from coarse to fine. First, the vamana index is used to quick find the nearst clusters. Then, within these clusters, the quantization-accelerated index is uesed for fast distance comparisons to identify the coarsely ranked candidates. Finally, SIMD instructions are used to re-rank these candidates, and the final results are returned.

text2image-10M https://github.com/AndrewHYu/Hanns

magdalendobson commented 3 months ago

Thanks for your contribution. I am evaluating it now and will get back to you on how it goes!

magdalendobson commented 3 months ago

I ran with the downloaded index and got the following results:

hanns,"hanns,tree=27/40000,reorder=111",text2image-10M,10,53085.03723024023,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.8774520000000001 hanns,"hanns,tree=27/40000,reorder=130",text2image-10M,10,51222.16584203003,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.882422 hanns,"hanns,tree=32/40000,reorder=140",text2image-10M,10,46858.49102240073,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.8944110000000001 hanns,"hanns,tree=32/40000,reorder=150",text2image-10M,10,46771.317990241405,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.896185 hanns,"hanns,tree=34/40000,reorder=150",text2image-10M,10,45381.62378698972,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.899572 hanns,"hanns,tree=34/40000,reorder=155",text2image-10M,10,45685.10712457384,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.900311 hanns,"hanns,tree=36/40000,reorder=150",text2image-10M,10,44630.44910364101,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.9026080000000001 hanns,"hanns,tree=37/40000,reorder=145",text2image-10M,10,44957.96616927795,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.9031560000000001 hanns,"hanns,tree=38/40000,reorder=140",text2image-10M,10,44787.13982548163,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.903562 hanns,"hanns,tree=42/40000,reorder=160",text2image-10M,10,41713.34961169815,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.911723 .

These seem to agree with your posted figure. Now running without the downloaded index. By the way, your index building code seems to download a file called config.pb even when the download is disabled. Inspecting looks like it just contains parameters, but can you just confirm that it doesn't contain any pre-computed index information?

AndrewHYu commented 3 months ago

I ran with the downloaded index and got the following results:

hanns,"hanns,tree=27/40000,reorder=111",text2image-10M,10,53085.03723024023,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.8774520000000001 hanns,"hanns,tree=27/40000,reorder=130",text2image-10M,10,51222.16584203003,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.882422 hanns,"hanns,tree=32/40000,reorder=140",text2image-10M,10,46858.49102240073,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.8944110000000001 hanns,"hanns,tree=32/40000,reorder=150",text2image-10M,10,46771.317990241405,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.896185 hanns,"hanns,tree=34/40000,reorder=150",text2image-10M,10,45381.62378698972,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.899572 hanns,"hanns,tree=34/40000,reorder=155",text2image-10M,10,45685.10712457384,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.900311 hanns,"hanns,tree=36/40000,reorder=150",text2image-10M,10,44630.44910364101,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.9026080000000001 hanns,"hanns,tree=37/40000,reorder=145",text2image-10M,10,44957.96616927795,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.9031560000000001 hanns,"hanns,tree=38/40000,reorder=140",text2image-10M,10,44787.13982548163,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.903562 hanns,"hanns,tree=42/40000,reorder=160",text2image-10M,10,41713.34961169815,0.0,1.2159347534179688e-05,5185752.0,0.0,0.0,ood,0.911723 .

These seem to agree with your posted figure. Now running without the downloaded index. By the way, your index building code seems to download a file called config.pb even when the download is disabled. Inspecting looks like it just contains parameters, but can you just confirm that it doesn't contain any pre-computed index information?

yes,it's parameters for search

magdalendobson commented 3 months ago

I was able to build the index from scratch and confirm that it builds within the time and memory limits. I got the following results:
2: hanns,tree=34/40000,reorder=150 0.899 46754.467 4: hanns,tree=42/40000,reorder=160 0.911 42331.832 6: hanns,tree=32/40000,reorder=140 0.894 48426.372 9: hanns,tree=36/40000,reorder=150 0.902 45949.345 12: hanns,tree=32/40000,reorder=150 0.895 47507.687 14: hanns,tree=27/40000,reorder=111 0.877 54890.437 15: hanns,tree=27/40000,reorder=130 0.882 53006.511 16: hanns,tree=38/40000,reorder=140 0.903 45162.090
20: hanns,tree=34/40000,reorder=155 0.899 46443.915 21: hanns,tree=37/40000,reorder=145 0.902 45388.283

These agree with the results you shared, and that I found with the pre-computed index. I will approve the merge and speak with the other admins about updating our official results. Great entry!

arron2003 commented 2 months ago

Hi @AndrewHYu
Thanks for submitting.

I wonder if you can clarify the relationship between your submission and ScaNN? It looks like your submission loads a ScaNN index: https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/neurips23/ood/hanns/hanns.py#L23-L46

The config.pb file is also identical to that of the ScaNN submission:

diff <(curl https://hanns.obs.ap-southeast-1.myhuaweicloud.com/v2/config.pb) <(curl https://storage.googleapis.com/scann/big-ann-2023/ood/scann_config.pb)

@magdalendobson for FYI.

AndrewHYu commented 2 months ago

Hi @arron2003 Thanks for your reminder and contributions. We used the ScaNN clustering method, and we found that there are many excellent designs that can improve performance and accuracy. Then some configuration items are reused, so the config.pb file is directly used. We will update the readme for details.

harsha-simhadri commented 2 months ago

@AndrewHYu Could you please share your name, affiliation and any collaborators on this code?