stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.68k stars 355 forks source link

Fix issue #270: Duplicate search results when k is a high value #283

Closed paul7Junior closed 6 months ago

paul7Junior commented 6 months ago

This PR addresses the bug "Duplicate search results when k is a high value #270"

Based on what I can observe, the issue arises when global_approx_scores is empty, but nfiltered_docs still expects more pids than available.

Original Issue: in filter_pids_helper function in filter_pids.cpp file

for (int i = 0; i < nfiltered_docs; i++) {
        std::pair<float, int> score_and_pid = global_approx_scores.top();
        filtered_pids[i] = score_and_pid.second;
        if (!global_approx_scores.empty()) {
            global_approx_scores.pop();
        }
    }

Proposed solution:

    std::vector<int> filtered_pids;
    for (int i = 0; i < nfiltered_docs; i++) {
        if (global_approx_scores.empty()) {
            break;
        }
        std::pair<float, int> score_and_pid = global_approx_scores.top();
        global_approx_scores.pop();
        filtered_pids.push_back(score_and_pid.second);
    }

For the final run to filter pids, there is no condition to stop if global_approx_scores priority_queue is empty. If in the previous filtering operations the # of pids is less than nfiltered_docs then you'd keep looking for pids up to nfiltered_docs even tho there isnt enough pids to fill your final result.

Also consequence of this is you don't know beforehand the final size of your final_filtered_pids, we only know its maximum nfiltered_docs.

To address that, I removed the fixed sized assigned array and used vectors that are returned from filter_pids_helper function, its a bit more functional style programming, so eventually safer as well.

I welcome any feedback.

okhat commented 6 months ago

Man this is incredible, what a deep dive. This part of the code is probably most familiar to @santhnm2 , Keshav let me know if I should do the review

santhnm2 commented 6 months ago

LGTM, @okhat if you want to just do a quick check to make sure it works then we can merge