quickwit-oss / search-benchmark-game

Search engine benchmark (Tantivy, Lucene, PISA, ...)
https://tantivy-search.github.io/bench/
MIT License
75 stars 36 forks source link

Add a Lucene 9.8 engine. #48

Closed jpountz closed 11 months ago

jpountz commented 11 months ago

This PR does a few changes:

Closes #44

jpountz commented 11 months ago

Lucene 9.8 is released, this PR is now ready from my point of view.

PSeitz commented 11 months ago

Thanks! How does the index reordering work?

jpountz commented 11 months ago

This implements an algorithm called "Recursive graph bisection", which is the same algorithm that the pisa-0.8.2 engine uses. Here is the Lucene PR where it was implemented: https://github.com/apache/lucene/pull/12489.

This algorithm tries to reorder doc IDs to minimize the sum of the logs of gaps between consecutive doc IDs in postings. This helps make postings more compact, but this also has the interesting side-effect of clustering documents that have similar postings together, which in-turn helps conjunctions.

The original paper is good but I found this follow-up reproducibility study to be a better introduction to this algorithm.

PSeitz commented 11 months ago

That's pretty interesting. I think it would be nice to make this configurable

jpountz commented 11 months ago

Do you have opinions on how to make this configurable? FWIW testing performance without reordering is as simple as replacing idx-reordered with idx in the Makefile target that runs queries. If we want to make it easier to test with and without doc ID reordering, I think I could easily split the Lucene 9.8 engine into lucene-9.8 and lucene-9.8-bp or something like that.

PSeitz commented 11 months ago

For now I think it would be good to have lucene-9.8 and lucene-9.8-reorder

jpountz commented 11 months ago

@PSeitz It's now split, I used the name lucene-9.8.0-bp if you don't mind to be more explicit about what reordering is implemented (e.g. another option is to sort the index by a min_hash of the terms of each document).

I'm not too familiar with make so there might be better ways of doing what I tried to do...

PSeitz commented 11 months ago

Thanks! There may be better ways with make, but the duplication variant is unamibigious