Possible bug in streaming evaluation code?

harsha-simhadri / big-ann-benchmarks

Framework for evaluating ANNS algorithms on billion scale datasets.

https://big-ann-benchmarks.com

MIT License

313 stars 103 forks source link

Possible bug in streaming evaluation code? #279

Closed arron2003 closed 4 months ago

arron2003 commented 4 months ago

When evaluating on the streaming track by running python data_export.py --out res.csv, I noticed that there are a lot of warning of Found cached result. Looking at benchmark/plotting/utils.py, it seems that the recall results are being cached across steps of streaming evaluation.

Changing benchmark/plotting/utils.py on L160

                assert len(true_nn_across_steps) == len(run_nn_across_steps)
                for (true_nn, run_nn) in zip(true_nn_across_steps, run_nn_across_steps):
                    v += metric["function"](true_nn, run_nn, metrics_cache, properties)
                v /= len(run_nn_across_steps)

                assert len(true_nn_across_steps) == len(run_nn_across_steps)
                for (true_nn, run_nn) in zip(true_nn_across_steps, run_nn_across_steps):
                    if 'knn' in metrics_cache:
                      del metrics_cache['knn']
                    v += metric["function"](true_nn, run_nn, metrics_cache, properties)
                v /= len(run_nn_across_steps)

solves the issue for me. I can send a PR if needed.

BTW, is data_export.py the recommended way to run streaming track evaluation?

arron2003 commented 4 months ago

@harsha-simhadri

arron2003 commented 4 months ago

@maumueller as well.

I believe this may have an implication on the correctness of the evaluation number for the streaming track.

maumueller commented 4 months ago

@arron2003 I'll point @harsha-simhadri to it since he was running the streaming evaluation. In which sense did the results differ, can you give a concrete example?

Using data_export.py is indeed to correct way to run the evaluation.

harsha-simhadri commented 4 months ago

@arron2003 Please do file a PR and we can validate it with experiments. Did making this change result in a variation of the leaderboard?

arron2003 commented 4 months ago

Just sent the PR.

Without the fix (clear_cache=False), you can see the first search step results is being cached and contains lines such as:

Found cached result
Found cached result
Found cached result
Found cached result
Recall:  [0.88368, 0.88368, 0.88368, 0.88368, 0.88368,
...

Note that the number in the res.csv matches that from the leaderboard csv: https://gist.github.com/arron2003/46658837a80aed5d9a61f49aef385b1f#file-before-log-L2417 https://github.com/arron2003/big-ann-benchmarks/blob/77c53b00267a7cb811ade04b26bbf3772a138811/neurips23/streaming/res_final_runbook_AzureD8lds_v5.csv#L6

With the fix (clear_cache=True), recall is computed at every step, and contains lines such as:

Computing knn metrics
Warning: 5/10000 queries contained ties accounted for in recall
Recall:  [0.88368, 0.8705299999999999, 0.8069200000000001, 0.8131299999999999, 0.81701, 0.81763, 
...

I believe this explains why the recall is very high in many submissions. With the fix, the resulting numbers is quite different from the original leadership csv. For example, "diskann(('R50_L50', {'Ls': 70, 'T': 8}))" now becomes 0.72145 instead of 0.88368. https://gist.github.com/arron2003/46658837a80aed5d9a61f49aef385b1f#file-after-log-L4020-L4022

So I am afraid the whole leaderboard needs to be re-computed.

maumueller commented 4 months ago

Thanks for the great contribution!

@harsha-simhadri Could you recompute the numbers using the PR? I don't have the raw hdf5 result files available. (I'm fine with you merging #280 as well, maybe the debug output is actually useful.)