harsha-simhadri / big-ann-benchmarks

Framework for evaluating ANNS algorithms on billion scale datasets.
https://big-ann-benchmarks.com
MIT License
313 stars 103 forks source link

Add ScaNN solutions in big-ann-benchmarks2023 #286

Closed arron2003 closed 1 month ago

arron2003 commented 3 months ago

ScaNN is an open source package and the results are obtained using the 1.3.1 version.

Technical details are described in our papers:

  1. "SOAR: Improved Indexing for Approximate Nearest Neighbor Search" NeurIPS2023 https://openreview.net/pdf?id=QvIvWMaQdX
  2. "Accelerating Large-Scale Inference with Anisotropic Vector Quantization" ICML2020 http://proceedings.mlr.press/v119/guo20h.html

Note that we re-ran all algorithms for streaming and ood tracks and we are happy to provide a home for all the hdf5 files.

arron2003 commented 3 months ago

@harsha-simhadri @maumueller for review. Thanks!

arron2003 commented 3 months ago

Updated the streaming results from Pinecone and Zilliz - previously some of the runs timed out (because the submitted configs were configured using query-args instead of run-groups, which makes the whole run-groups subject to 1 hour time limit).

maumueller commented 3 months ago

This looks good to me.

arron2003 commented 3 months ago

A gentle ping :)

Let me know if there are issues that need to be addressed.

ingberam commented 3 months ago

@arron2003 nice contribution. A few questions: How exactly did you run everything? is this on the formal Azure machine or on a similar GCP one? (if this is not the exact machine, then the OOD results, for instance, are not quite comparable).

Re the streaming track: the results that you show are quite different than the ones @harsha-simhadri and I got. We can't merge it like this without understanding what is going on there.

I have a proposal - can you issue a separate PR for OOD and Streaming so that we can discuss both issues separately? For OOD, I can try and reproduce your result later this week.

arron2003 commented 3 months ago
  1. Everything is ran on a Azure machine "Standard D8lds v5". Let me know if you run into any issue on reproducing the result.

  2. The change on the streaming result is due to #279 - previously because of a recall caching bug, the recall of the first step is repeatedly used for all later steps and that was why everyone had very high recall (99%+). After that was fixed in #280, the streaming result is changed quite drastically.

Regarding separating OOD & Streaming - sounds to good to me. I will move the third commit to a separate PR.

arron2003 commented 3 months ago

Moved the third commit to #288.

Please go ahead to review the OOD part. Thanks!

arron2003 commented 2 months ago

For prosperity, the hdf5 files are temporarily located here:

OOD hdf5 files Streaming hdf5 files

ingberam commented 2 months ago

@arron2003 I have good news and bad news:

For a pre-built index, I was able to successfully reproduce your results on the standard eval machine:

scann,"ScaNN,tree=27/40000,AH2,reorder=140",text2image-10M,10,49132.05737468046,0.0,3.0040740966796875e-05,5178984.0,0,0,ood,0.8800800000000001
scann,"ScaNN,tree=35/40000,AH2,reorder=150",text2image-10M,10,41836.628535320415,0.0,2.7894973754882812e-05,5194068.0,0,0,ood,0.90015
scann,"ScaNN,tree=42/40000,AH2,reorder=160",text2image-10M,10,37371.3870150347,0.0,2.5987625122070312e-05,5194060.0,0,0,ood,0.9140330000000001

However, when I tried to actually build it (setting "download"=false in the config file), I am hitting OOM on the eval machine.:

2024-04-14 15:14:25,641 - annb.37e72b66c572 - ERROR - Child process for container 37e72b66c572returned exit code 137 with message 

According to the message in the code ("... use a higher RAM VM"), this is expected. The problem is, that the competition generally requires the index to be built on the eval machine (with 16 GB)... Any chance you can make the training work within the memory limits?

@harsha-simhadri @maumueller thoughts on the memory issue? Also, @harsha-simhadri, I'll let you comment on the OOD issue (the evaluation bug etc.)?

arron2003 commented 2 months ago

@ingberam thanks for confirming the results on Azure machine.

For using pre-built index - I think this should not be a huge concern since some of the existing submission also uses downloaded index.

arron2003 commented 2 months ago

@maumueller @harsha-simhadri - any thoughts? I think this PR should OK to merge as is since the results are already confirmed on Azure D8lds v5?

arron2003 commented 2 months ago

Another ping

maumueller commented 2 months ago

@ingberam It seems to me @arron2003 has a good case here pointing at submissions that don't even allow for a reproduction of the results.

@arron2003: It was my understanding that you also built the index on the azure machine?

arron2003 commented 2 months ago

The pre-built index was trained on a higher ram VM - mostly we haven't prioritize memory efficiency on streaming reading fvecs and streaming serialization to fit into 16G, but that's definitely doable.

For this PR though I think the current status should be acceptable since (the same) pre-built evaluation index has been used for other submissions.

ingberam commented 2 months ago

Re the OOD submission - I am fine with merging the scann submission, and with the actual results (that I reproduced). I will leave it to @harsha-simhadri to decide whether the scann submission should have an asterisk saying it needs more RAM than the others to build (Harsha needs to approve the website updates anyway and I don't have permissions).

Re the streaming solution - I was not part of the loop regarding the update to the evaluation framework, so I'll leave it to harsha as well. One thing that looks odd is that the puck submissions all have very low recall now (<0.1), where before it was better than all the rest. Maybe something went wrong with these specific runs?

arron2003 commented 2 months ago

@harsha-simhadri for FYI.

Changed OOD training to use 8M datapoints and insert the rest 2M, in order to fit into 16G RAM restriction.

ingberam commented 2 months ago

Hi, I confirm that with the new config the index gets built successfully.

Here are the results that I see (it took about 10 hours on the standard Azure machine):

scann,"ScaNN,tree=27/40000,AH2,reorder=140",text2image-10M,10,47844.74694997586,0.0,12088.404091835022,6575684.0,0,0,ood,0.8831709999999999
scann,"ScaNN,tree=35/40000,AH2,reorder=150",text2image-10M,10,39796.43992847787,0.0,12067.16537451744,6560288.0,0,0,ood,0.89968
scann,"ScaNN,tree=42/40000,AH2,reorder=160",text2image-10M,10,39665.922519551124,0.0,12043.357325553894,6572808.0,0,0,ood,0.910122

There are two things here - first, the result is a bit slower than before (might be due to noise or some other reason), and second, now the recall is slightly below the 0.9 bar. I don't think the difference between 0.8997 and 0.9 is material at all, but generating the top results is done automatically using a script, that would filter this result out.

I propose the following: @arron2003 can you add a few more query args points to the plot? Formally each entry is allowed up to 10. That would show if the QPS drop was by chance, and will also most likely have a point with recall just slightly above 0.9.

arron2003 commented 2 months ago
  1. I am pretty sure that the drop of QPS is due to measurement error on CloudVM. I am using "South Central US (Zone 3)" FWIW.
  2. The drop of recall is on the 4 decimal place so this may be due to randomness on training the tree.

I have just changed reordering number from 150 to 155 to allow slightly higher recall to make sure we have some safe margin above 90% (even with sampling in training). Below is the result I expect:

scann,"ScaNN,tree=35/40000,AH2,reorder=155",text2image-10M,10,41528.078976443685,0.0,12210.711037635803,6446548.0,0,0,ood,0.9006

arron2003 commented 2 months ago

Actual we just updated ScaNN version on https://pypi.org/project/scann/ so that recall is even better :)

  1. Updated the docker file - so please rerun installation with python3.10 install.py --neurips23track ood --algorithm scann
  2. Added more query args datapoints. Below are the expected results:
    algorithm,parameters,dataset,count,qps,distcomps,build,indexsize,mean_ssd_ios,mean_latency,track,recall/ap
    scann,"ScaNN,tree=34/40000,AH2,reorder=155",text2image-10M,10,43764.153975216715,0.0,11931.600163459778,6682692.0,0,0,ood,0.900396
    scann,"ScaNN,tree=35/40000,AH2,reorder=150",text2image-10M,10,42507.63490838009,0.0,11931.600163459778,6682692.0,0,0,ood,0.9010999999999999
    scann,"ScaNN,tree=35/40000,AH2,reorder=155",text2image-10M,10,42130.863603028665,0.0,11931.600163459778,6682692.0,0,0,ood,0.901859
    scann,"ScaNN,tree=36/40000,AH2,reorder=150",text2image-10M,10,42312.46164008728,0.0,11931.600163459778,6682692.0,0,0,ood,0.902675
    scann,"ScaNN,tree=37/40000,AH2,reorder=145",text2image-10M,10,42563.98037124791,0.0,11931.600163459778,6682692.0,0,0,ood,0.9032030000000001
    scann,"ScaNN,tree=38/40000,AH2,reorder=140",text2image-10M,10,42268.06450884723,0.0,11931.600163459778,6682692.0,0,0,ood,0.9036339999999999
ingberam commented 2 months ago

Confirming the following result on a fresh index build on the standard 16GB machine:

scann,"ScaNN,tree=27/40000,AH2,reorder=140",text2image-10M,10,48296.886642618236,0.0,12124.119821548462,6587528.0,0,0,ood,0.8840809999999999
scann,"ScaNN,tree=34/40000,AH2,reorder=155",text2image-10M,10,43372.02235782551,0.0,12131.287629842758,6612140.0,0,0,ood,0.899724
scann,"ScaNN,tree=35/40000,AH2,reorder=150",text2image-10M,10,41897.94105624706,0.0,12131.287629842758,6612140.0,0,0,ood,0.9004429999999999
scann,"ScaNN,tree=35/40000,AH2,reorder=155",text2image-10M,10,41987.20634235152,0.0,12131.287629842758,6612140.0,0,0,ood,0.9012399999999999
scann,"ScaNN,tree=36/40000,AH2,reorder=150",text2image-10M,10,42285.962239376444,0.0,12131.287629842758,6612140.0,0,0,ood,0.901969
scann,"ScaNN,tree=37/40000,AH2,reorder=145",text2image-10M,10,42854.013476886255,0.0,12131.287629842758,6612140.0,0,0,ood,0.90249
scann,"ScaNN,tree=38/40000,AH2,reorder=140",text2image-10M,10,42326.04841208092,0.0,12131.287629842758,6612140.0,0,0,ood,0.902833
scann,"ScaNN,tree=42/40000,AH2,reorder=160",text2image-10M,10,38892.123997724135,0.0,12105.756680965424,6689552.0,0,0,ood,0.911592

In order to move forward and merge everything related to ood let's do the following. @arron2003 can you pleaase:

  1. update neurips23/ongoing_leaderboard/ood/res_public_queries_AzureD8lds_v5.csv with these results and update the best scann result of 42,854 in the operating_points file
  2. update the plot with the new data points for scann
arron2003 commented 2 months ago

Updated. Now this should be ready to merge. Thanks for confirming the results!