pisa-engine / pisa

PISA: Performant Indexes and Search for Academia
https://pisa-engine.github.io/pisa/book
Apache License 2.0
930 stars 64 forks source link

stability of document order when tied #508

Open seanmacavaney opened 1 year ago

seanmacavaney commented 1 year ago

Describe the bug Since sort_heap is not stable and topk_queue only sorts on score, the order of results that have the same scores can differ between runs. This is harmful for the repeatbility of results.

To Reproduce Steps to reproduce the behavior:

  1. Follow the steps to index and execute queries. Be sure to include a query that will result in a score tie for the ranking model. For instance, the query "chemical reactions" over the vaswani dataset.
  2. Observe the order of documents with tied scores.
  3. Repeat and re-observe several times.

Error message No error message.

Expected behavior I expect results to be ordered ascending by DocId as a secondary sort when scores result in ties.

Environment info Operating System: Ubuntu 22.04.1 Compiler name: gcc Compiler version: 9

seanmacavaney commented 1 year ago

I'd expect that modifying the following topk_queue method to consider DocId would solve this issue, but I am not sure about the efficiency implications, or if other places need to be changed as well.

    [[nodiscard]] constexpr static auto
    min_heap_order(entry_type const& lhs, entry_type const& rhs) noexcept -> bool
    {
        return lhs.first > rhs.first || lhs.second < rhs.second;
    }
seanmacavaney commented 1 year ago

Whoops, actually more like:

return lhs.first == rhs.first ? lhs.second < rhs.second : lhs.first > rhs.first;
elshize commented 1 year ago

This is a tricky issue because of efficiency considerations.

But if the problem is the stability of of the final sort, how about we add this additional sort condition only at the very end? This, I believe, would still be reproducible, at least if we use the same algorithm, and add documents in the same order.

What do you think about this?

I would like to avoid adding another condition in the heap insert itself, because that might slow thinigs down unnecessarily.

seanmacavaney commented 1 year ago

That sounds fair to me; I agree that every statement matters during the query processing itself.

The one caveat is that if there happens to be a tied score at the end of the ranked list (e.g., rank 1000 and rank 1001 have the same scores), the document returned at rank 1000 is non-deterministic. This feels like a fair concession, though.

elshize commented 1 year ago

BTW, I have this PR pending: https://github.com/pisa-engine/pisa/pull/507 where I implement a custom pop-push operation instead of using std::pop_heap.

We could also implement our own pop_push (used only before the heap fills up), so that we have a fixed implementation across different compilation environments.

With those, at least the logic could be well-defined and documented, and it would be deterministic in the sense that if you run the same algorithm on the same data but different environments (say, me compiling and running with gcc/libstdc++ and you with clang/libc++, for example), the results should be the same.

But we wouldn't be able to guarantee that any algorithm would return the exact same results in that edge case of the same score around kth position, because they don't have the same exact sequence of heap inserts.

Does that make sense or am I missing something important?

seanmacavaney commented 1 year ago

Sounds good to me! Thanks @elshize!

elshize commented 1 year ago

@seanmacavaney FYI, I opned a PR for this, if you'd like to give it a look.