Closed arron2003 closed 8 months ago
@harsha-simhadri
@maumueller as well.
I believe this may have an implication on the correctness of the evaluation number for the streaming track.
@arron2003 I'll point @harsha-simhadri to it since he was running the streaming evaluation. In which sense did the results differ, can you give a concrete example?
Using data_export.py
is indeed to correct way to run the evaluation.
@arron2003 Please do file a PR and we can validate it with experiments. Did making this change result in a variation of the leaderboard?
Just sent the PR.
Without the fix (clear_cache=False), you can see the first search step results is being cached and contains lines such as:
Found cached result
Found cached result
Found cached result
Found cached result
Recall: [0.88368, 0.88368, 0.88368, 0.88368, 0.88368,
...
Note that the number in the res.csv matches that from the leaderboard csv: https://gist.github.com/arron2003/46658837a80aed5d9a61f49aef385b1f#file-before-log-L2417 https://github.com/arron2003/big-ann-benchmarks/blob/77c53b00267a7cb811ade04b26bbf3772a138811/neurips23/streaming/res_final_runbook_AzureD8lds_v5.csv#L6
With the fix (clear_cache=True), recall is computed at every step, and contains lines such as:
Computing knn metrics
Warning: 5/10000 queries contained ties accounted for in recall
Recall: [0.88368, 0.8705299999999999, 0.8069200000000001, 0.8131299999999999, 0.81701, 0.81763,
...
I believe this explains why the recall is very high in many submissions. With the fix, the resulting numbers is quite different from the original leadership csv. For example, "diskann(('R50_L50', {'Ls': 70, 'T': 8}))" now becomes 0.72145 instead of 0.88368. https://gist.github.com/arron2003/46658837a80aed5d9a61f49aef385b1f#file-after-log-L4020-L4022
So I am afraid the whole leaderboard needs to be re-computed.
Thanks for the great contribution!
@harsha-simhadri Could you recompute the numbers using the PR? I don't have the raw hdf5 result files available. (I'm fine with you merging #280 as well, maybe the debug output is actually useful.)
When evaluating on the streaming track by running
python data_export.py --out res.csv
, I noticed that there are a lot of warning ofFound cached result
. Looking atbenchmark/plotting/utils.py
, it seems that the recall results are being cached across steps of streaming evaluation.Changing
benchmark/plotting/utils.py
on L160to
solves the issue for me. I can send a PR if needed.
BTW, is
data_export.py
the recommended way to run streaming track evaluation?