Closed jpountz closed 11 months ago
Lucene 9.8 is released, this PR is now ready from my point of view.
Thanks! How does the index reordering work?
This implements an algorithm called "Recursive graph bisection", which is the same algorithm that the pisa-0.8.2
engine uses. Here is the Lucene PR where it was implemented: https://github.com/apache/lucene/pull/12489.
This algorithm tries to reorder doc IDs to minimize the sum of the logs of gaps between consecutive doc IDs in postings. This helps make postings more compact, but this also has the interesting side-effect of clustering documents that have similar postings together, which in-turn helps conjunctions.
The original paper is good but I found this follow-up reproducibility study to be a better introduction to this algorithm.
That's pretty interesting. I think it would be nice to make this configurable
Do you have opinions on how to make this configurable? FWIW testing performance without reordering is as simple as replacing idx-reordered
with idx
in the Makefile target that runs queries. If we want to make it easier to test with and without doc ID reordering, I think I could easily split the Lucene 9.8 engine into lucene-9.8
and lucene-9.8-bp
or something like that.
For now I think it would be good to have lucene-9.8
and lucene-9.8-reorder
@PSeitz It's now split, I used the name lucene-9.8.0-bp
if you don't mind to be more explicit about what reordering is implemented (e.g. another option is to sort the index by a min_hash of the terms of each document).
I'm not too familiar with make
so there might be better ways of doing what I tried to do...
Thanks! There may be better ways with make
, but the duplication variant is unamibigious
This PR does a few changes:
Closes #44