Open seanmacavaney opened 1 year ago
I'd expect that modifying the following topk_queue
method to consider DocId would solve this issue, but I am not sure about the efficiency implications, or if other places need to be changed as well.
[[nodiscard]] constexpr static auto
min_heap_order(entry_type const& lhs, entry_type const& rhs) noexcept -> bool
{
return lhs.first > rhs.first || lhs.second < rhs.second;
}
Whoops, actually more like:
return lhs.first == rhs.first ? lhs.second < rhs.second : lhs.first > rhs.first;
This is a tricky issue because of efficiency considerations.
But if the problem is the stability of of the final sort, how about we add this additional sort condition only at the very end? This, I believe, would still be reproducible, at least if we use the same algorithm, and add documents in the same order.
What do you think about this?
I would like to avoid adding another condition in the heap insert itself, because that might slow thinigs down unnecessarily.
That sounds fair to me; I agree that every statement matters during the query processing itself.
The one caveat is that if there happens to be a tied score at the end of the ranked list (e.g., rank 1000 and rank 1001 have the same scores), the document returned at rank 1000 is non-deterministic. This feels like a fair concession, though.
BTW, I have this PR pending: https://github.com/pisa-engine/pisa/pull/507 where I implement a custom pop-push operation instead of using std::pop_heap
.
We could also implement our own pop_push
(used only before the heap fills up), so that we have a fixed implementation across different compilation environments.
With those, at least the logic could be well-defined and documented, and it would be deterministic in the sense that if you run the same algorithm on the same data but different environments (say, me compiling and running with gcc/libstdc++ and you with clang/libc++, for example), the results should be the same.
But we wouldn't be able to guarantee that any algorithm would return the exact same results in that edge case of the same score around k
th position, because they don't have the same exact sequence of heap inserts.
Does that make sense or am I missing something important?
Sounds good to me! Thanks @elshize!
@seanmacavaney FYI, I opned a PR for this, if you'd like to give it a look.
Describe the bug Since
sort_heap
is not stable andtopk_queue
only sorts on score, the order of results that have the same scores can differ between runs. This is harmful for the repeatbility of results.To Reproduce Steps to reproduce the behavior:
Error message No error message.
Expected behavior I expect results to be ordered ascending by DocId as a secondary sort when scores result in ties.
Environment info Operating System: Ubuntu 22.04.1 Compiler name: gcc Compiler version: 9