mattico / elasticlunr-rs

A partial port of elasticlunr to Rust. Intended to be used for generating compatible search indices.
Apache License 2.0
52 stars 23 forks source link

Faster English Porter's Stemmer #48

Closed thecodrr closed 2 months ago

thecodrr commented 1 year ago

This gets rid of the extremely slow regex based implementation & introduces an all new implementation taken from rust-stem. I had to make a few changes to make it compatible with elasticlunr.

How fast is it exactly?

Old:

create_index            time:   [3.4562 ms 3.4664 ms 3.4838 ms]

New:

create_index            time:   [509.29 µs 511.17 µs 513.40 µs]                         
                        change: [-85.031% -84.680% -84.282%] (p = 0.00 < 0.05)
                        Performance has improved.

All tests are passing without any changes to them or the outputs so this is 100% compatible. I bet this can be further improved but it's a good start.

mattico commented 1 year ago

I remember working on something similar a long time ago but came to the conclusion that it was already fast enough. Would anybody be bothered if it took 100ms to build the search index? I could be wrong.

We'd want to at least add the license header from rust-stem if we went this route.

thecodrr commented 1 year ago

Would anybody be bothered if it took 100ms to build the search index

No harm in having better performance if it's possible without any compromises. The stemmer was one main bottleneck which this PR aims to remove. Having better performance means you can do other more important stuff in the same amount of time. A few ms here and there quickly add up to create a very clunky & slow UX.

If we talk especially about index building then it doesn't matter if you make the Index only once but if you are doing it every few seconds (e.g. in case of generating docs) then it quickly starts to make a difference in your productivity. If we had incremental indexing that'd help alot in that use case but since the index is built from scratch every single time, it needs to be as fast as possible.

We'd want to at least add the license header from rust-stem if we went this route.

Of course. I will make the necessary changes ASAP.

Keats commented 1 year ago

That would be nice for Zola as well

mattico commented 2 months ago

Sorry about the delay, this fell off my radar.