jeancroy / FuzzySearch

:mag: Fast autocomplete suggestion engine using approximate string matching
MIT License
194 stars 32 forks source link

Index Structure #15

Closed jeancroy closed 6 years ago

jeancroy commented 6 years ago

Current Situation

The classic edit distance algorithm is O(NMK)
Where N is number of item. M is average length of item and K the length of search term.

With bit parallelism we are able to shave K. Which bring us closer to O(N*M) for a single word query.

We still have to visit every character of every items !

We would like an index structure that can reduce N. That is pre-selecting some good candidates, on which we do the slower but higher quality scoring.

Index structure

We can improve over that using an index data structure.

They basically amount to pre-generating a number of spelling error that we can accept, and organize those results in an efficient way.

Symmetric Delete.

Out of the misspelling we can pre-compute, delete is one of the easiest. In particular because it does not require knowledge of the alphabet where the word exist.

The symmetric delete algorithm apply delete variation computation to both candidate and search term.

3 out of 5 strategy.

One way to understand symmetric delete, is that we are trying to match X out of Y first characters of the word. To do so we generate every subset of X characters at a given delete-distance.

For example, two deletes variations of "hello" and "hills" are:

// del2("hello")
[ 'llo', 'elo', 'ell', 'hlo', 'hll', 'heo', 'hel' ]

// del2("hills")
[ 'lls', 'ils', 'ill', 'hls', 'hll', 'his', 'hil' ]

They share the hll key, as they indeed match from a 3 out of 5 standpoint.

// del2("jello")
[ 'llo', 'elo', 'ell', 'jlo', 'jll', 'jeo', 'jel' ]

Jello and hello share llo, elo, ell keys. And thus is a better match than hello and hills.

Prefix and fallback

For longer words we simply take the first Y characters.

For smaller words we have fallback X out of Y strategies.

We also compute smaller X strategy for long words. (This allow smaller search query to match and is a difference between spell-check and autocomplete.)

Strategies

We following principle "editorial choice"

We consider the following strategy X out of Y:

(XoY) 
3o6,3o5,3o4,3o3
2o4,2o3,2o2
1o1 

Note that, for example, 2o4 include every variation included in 2o3 and some more. So for every X, we only compute one XoY strategy, with the largest Y available.

1o1 index the word by the first letter, and 2o2 is not included because we only consider word > 3 chars.

Those rules are subject to change as we gain more feedback. For example 3o6 might generate too many variations.

For the search term we compute a single X out of Y strategy, so we use as many character typed as possible.

Absolute maximum on keys with above strategies ans a-z alphabet: 26^3 + 26^2 + 26 = 18278 keys.

Multi word

The multi-word strategy consist of splitting input in word and looping the key generation strategy on each of those words. All of the collected word part point to the same index entry.

This mean we'll have false positive where an entry can match the query using mix and match of word parts.

For now we accept this deficit because it simplify the index structure and it as a pre-selector we prefer to approximate toward false positive, than false negative.

A proper fix might involve two maps: key => word and word =>entry

This might be needed if each entry is a large corpus and the pre selection part is rendered useless because of word-part soup.

In which case:

For every word of query:

word-part => word => filter for minimum quality.

Put those word together and retrieve entries:

word => entry => filter for minimum quality.
entry => refined score => filter for minimum quality.

Additional reading

This was inspired by this blog post, in particular they make very bold speed claim and the algorythm is relatively easy to understand.

http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

Considerations.

Javascript come built in with a good hashmap implementation, and we are taking advantage of that.

Data structure is mostly flat compared to say a radix tree, and that can help import/export and thus offset some of the preparation time.

As implemented the index structure has some measure of quality. It might be enough for certain use (IE do not need to fold that into FuzzySearch)

If this is implemented (and turned of) FuzzySearch will become less fuzzy. It may be a good thing. Or if not, it may be a preferable trade off to something like max-inner.

Implementation.

I've made a proof of concept of the above. https://gist.github.com/jeancroy/9187627975a2632a8f8e4aaae17d358e

The work of integrating something like this into FuzzySearch only make sens if it is both much faster and fast enough for our need.

jeancroy commented 6 years ago

Hi @aguynamedben I think I have a way to speed up FuzzySearch with more acceptable compromise than max_inner.

It is based on this idea: http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

And I have a proof of concept here. https://gist.github.com/jeancroy/9187627975a2632a8f8e4aaae17d358e

I can fold this into FuzzySearch as a way to pre-select entries where score will be refined.

However this only make sens if we can make the proof of concept both much faster than FuzzySearch and fast enough for (y)our needs.

Would you mind bench-marking this on your particular dataset?

aguynamedben commented 6 years ago

This is awesome. Tonight I will port a benchmarking setup we use in development into FuzzySearch tooling so it’s easy to use and reference. It creates 100k fake records using Faker. I can also write some simple tests for “deep” searches that were failing on maxInners (especially with tagged search also being used).

I’ll focus on benchmarking and testing for now, and later I’ll take the time to grasp the algorithm. I’m excited. We’re also attempting to index large address books and most of our records are a document with a first name last name as the main part of the title. We use tagged search to find only users from only certain sources (i.e. “source:twitter.com title:ben s” should pull up a document with title of “Ben Standefer (@aguynamedben)” and domain of twitter.com.

Rad, I’ll hack on it tonight and send something over. 🕺

aguynamedben commented 6 years ago

I created a project to test IndexStore, assuming I'll test it against the current implementation of FuzzySearch: https://github.com/aguynamedben/fuzzracer

Benchmark Output

Full output

Includes output for 100k records, 500k records, 1M records (very intersting) Includes 10 sample searches for each dataset for quality checking

Note: All time are in ms

100k records Average response time across 1000 queries: 12.51 Average results length across 1000 queries: 584.847

500k records Average response time across 1000 queries: 90.328 Average results length across 1000 queries: 2281.039

1M records Average response time across 1000 queries: 151.343 Average results length across 1000 queries: 3878.813

Initial Thoughts

Overall, this is awesome. I'll test it with a UI next to verify that it's working.

FWIW, the only other feature of FuzzySearch we really like and want to keep is the tagged search. Do you know if this algorithm would still be able to include tagged search?

I'm happy to keep helping, let me know what to do next! I can also help with tooling/modernizing things in FuzzySearch, or writing/porting tests and benchmarking methods into FuzzySearch.

Thanks again, you rock!

aguynamedben commented 6 years ago

Re: my comment on output_limit, after re-reading the description I understand now that this is to select a subset of the large dataset as candidates for more expensive scoring. That makes sense now 👍The resulting dataset from this process will be further refined by more expensive scoring (instead of using the max_inner method, which has problems with large datasets when max_inners is set too small and the data isn't optimally ordered). Cool.

jeancroy commented 6 years ago

Yeah the idea could be 200ms => 12ms (preselect using store) + 12ms (all fzsearch options) And that is still an order of magnitude improvement.

Do you know if this algorithm would still be able to include tagged search?

Rigth now the algorithm is blind to the fact multiple word exist, multiple field exist and is blind of any character after the 6th of a word. And that's A-ok for the preselector use I imagined.

Yes it would be possible to add back features, but for now I want to try to reuse what is already built. A lot of fuzzysearch is input/output stage.

But yeah it's an interesting algorithm that could have it's own spin off project as/If I find a unique use for it.

The number of results is very high for some searches

One thing to keep in mind is that the score computed here is a low range integer. Say 0-20. So I expect a lot of equality. Part of a higher quality score is to break those equality so it make sens to have a lower cut of.

If the very high number of result is close to original size then we failed as a pre-selector. In which case, maxInner become an interesting option. We just did a pre-sorting and that was the tough pill to swallow regarding maxInner.

jeancroy commented 6 years ago

This is now integrated into FuzzySearch, I'll call it a beta feature.

It is disabled by default, to use it, set

options.use_index_store = true.

The autocomplete.html example has this feature as a checkbox. (As well as a 30k & 100k fakerjs name dataset)

Once enabled you can expect the following:

Note 1:

The one "worst case scenario" is when query match almost everything. This can be controlled using the store_max_results & store_thresh options.

Note 2:

This can last a few seconds. In production we would like to avoid multi-seconds computation on UI thread. Different works can be done to mitigate this. Ie: async batch , web worker, load precomputed data.

aguynamedben commented 6 years ago

Awesome, thanks for working on this, I'll try it out tonight!