mikemccand / luceneutil

Various utility scripts for running Lucene performance tests
Apache License 2.0
205 stars 115 forks source link

Consider contributing to the search benchmark game #150

Open LifeIsStrange opened 3 years ago

LifeIsStrange commented 3 years ago

Hi @mikemccand ! I recently discovered your blog, and I have to say I am a huge fan. I do not (yet) have decent knowledge in Lucene (in fact I have never used it) but it's a technology I use indirectly (through ELK/opensearch) and like you I have great interest in the idea of progress, and what better quantify progress than the idea of performance optimizations/benchmarking?

The (probably) most famous microbenchmark for comparing programming languages is the benchmark game In homage to its name, people have developped a Search benchmark game

I recently stumbled upon luceneutil and it seems to be a great set of benchmarks, useful for catching lucene regressions and improvements. However, the Search benchmark game has a complementary value despite being (currently) simpler in what kind of queries/dataset it test. As it allow to compare lucene performance with competitors. Among those, the rust library Tantivy stands out as being on average 2 times faster than lucene!

I'm sure that lucene has some specific, advanced optimizations that would make it faster for some specific kinds of queries and of course performance is only one criterion (among user friendliness, correctness and feature completeness) for the choice of a search library.

But Tantivy seems to be significantly and consistently faster for generic queries (at least on this dataset), and of course being rust based is an advantage over Java but it's possible that their advantage also come from software/algorithmic optimizations that lucene could take inspiration from (and conversely!)

I do not have the qualification nor the time but maybe that you could try to contribute to the benchmark in two ways: 1) make an optimized lucene variant by e.g:

2) by augmenting the benchmark game (adding indexing time benchmarks, benchmarking facets, etc)

Those are just suggestions from an enthusiast, feel free to ignore this issue! :)

jpountz commented 3 years ago

Tantivy actually looks pretty good. Rust allows it to be closer to hardware than Lucene and Tantivy seems to have some specialization for certain common cases like pure disjunctions of term queries vs. disjunctions of arbitrary queries. It used to lack significant features but it's catching up impressively quickly, e.g. it supports doc values and block-max WAND nowadays.

We're very much looking forward to Panama and the vector API, hopefully this will allow us to reduce the difference between Tantivy and Lucene. However I don't think that io_uring is something that we can leverage, I fear it would make our (already complicated) search abstractions even more complicated (except maybe for our simpler and I/O heavy APIs like stored fields?).

mikemccand commented 3 years ago

Thank you for all the enthusiasm @LifeIsStrange! I could not agree more that making benchmarking easier, more standardized, etc. is an excellent goal to allow true comparisons of the world's search engines. Tantivy (and Rust) look awesome! I love that Rust allows direct access to vectorized SIMD instructions, whereas in javaland we have to play silly games by carefully writing our java sources just so, so that HotSpot successfully just-in-time compiles to SIMD.

LifeIsStrange commented 3 years ago

Indeed Tantivy mention using:

SIMD integer compression when the platform/CPU includes the SSE2 instruction set

See a blog post about the tantivity founder talking about this optimization The author even mentionned you !

lucene core devs (yeah it is a very select club) who might be interested in a possible (unconfirmed) optimization opportunity.

But in fact, Java supports a clean, explicit vector API since OpenJDK 16! https://openjdk.java.net/jeps/414

Intro about java SIMD

I have played with it a bit and the API looks 1) much simpler than classical low level SIMD and 2) is superior in that it allow cross platform SIMD instructions (works on ARM out of the box) and 3) since the JVM has a runtime, it allow to select modern instructions (AVX vs SSE, SVM vs neon) and higher SIMD lane length (I'm not sure about the last one but maybe they can downscale AVX-512 to AVX-256 automatically?) edit: about the later point yes, they do not downscale if we harcode the lane length but they can use the optimal length if using PREFFERED_SPECIE IMHO this makes the JVM much superior to compiled languages regarding SIMD programming (although the Vector API is not yet 100% feature complete regarding advanced uses (contrary to .NET) and the runtime advantages might be leverageable in compiled languages by using specific runtime libraries (e.g maybe what the Intel MKL do?) Note BTW that TornadoVM is an extremely interesting runtime for Java (GPU auto offloading, SIMD, AMX, other accelerators..)

LifeIsStrange commented 3 years ago

Besides SIMD integer compression in tantivy, another interesting software optimization is his take on levenshtein automaton. Levenshtein automaton is key for the performance of fuzzy search queries and fuzzy search queries are a key, common task. Indeed, @mikemccand you did a great and fun blog post about it Since the blog post, some improvements have been made, such as a simplification made by yourself And there's something very strange: I cannot find this commit (2014) in the new Lucene repository despite being the same file, why??

Anyway in 2015, someone on Hackernews wrote a blog post implementing a much simpler implementation of the automaton although it might be less performant..

Then in 2019, the Tantivy founder wrote a blog post about levhenstein automaton In it, he mentions the flaws of the 2015 blog post but more importantly, performance wise he mention a limitation of the Lucene implementation!

Rather than just browsing the reachable states of the parametric automaton, it shoves all of the parametric states and all of their transitions. This is hurting performance pretty badly, but I assume automaton creation is already fast enough for most user’s need.

So maybe that lucene could prune the states and try to be smarter about it. such as explained in his next blog post

Independently of that, this repository mention a significant performance advantage vs Lucene DFA

From 4.0 they were pre-calculating a Levenshtein automaton, that accepts only the terms within edit distance N. And then at query time they were intersecting that automaton with all the terms in the dictionary. This would mean that they save time, because the don't need to calculate the Levenshtein distance for every single term, but it seems they still need to intersect the pre-built Levenshtein automaton with every single term in the dictionary, whereas SymSpell has to calculate the Levenshtein distance only for 0.0042% to 0.016% of the vocabulary.

Finally, this recent paper might be interesting

So to conclude, it seems there are interesting possible optimizations to explore regarding Lucene fuzzy search performance :)

digression: you might find this blog about finite state transducers and their various use cases interesting although, as always you were the first to introduce those notions ;)

@fulmicoton friendly tag since this comment is about some of your contributions