Implement max_inners argument for large datasets

aguynamedben commented 6 years ago

This mirrors the maxInners argument in fuzz-aldrin-plus.

@jeancroy guided me on the implementation here: #7

@jeancroy, the only thing I think could improve this would be moving the inserted for loop bail out to be higher, but it does also kind of make sense to have it down near (what I assume) is the more expensive part of the code. It's working great for me with debounce, max_inners of 200 on 100k names.

I see what you were saying in the issue a while back about setting it way too small (i.e. 25) and not being able to find documents late in the list. I tweaked a few other variables and have it working well with debounce and these settings:

For 100k names, using the debounce API:

        this.fuzzySearch = FuzzySearch({
          source: docs,
          keys: fuzzySearchKeys,
          identify_item: identifyItem,
          max_inners: 200,
          bonus_match_start: 0.8,
          bonus_token_order: 2.5,
          bonus_position_decay: 0.5,
          output_limit: 15,
          token_query_min_length: 3, // maybe not?
        });

Thanks so much for working on this library and helping me understand it. 👍

aguynamedben commented 6 years ago

Oh the other obvious improvement is to add tests for max_inners. I may do that when this spring is over... and add more tests in general. I'm happy to help with this codebase wherever necessary. I want to understand the algorithm better and need to read some of the whitepapers, but I can also do grunt work.

aguynamedben commented 6 years ago

Note: the first commit adds a test (something I was getting mixed up about with the options) and also upgrades some package based on warnings I saw during yarn install.

jeancroy commented 6 years ago

Thanks for this, I'll review later today or later in the week. For now I've made a brain dump of what I think I know on this topic.

What is this ?

This is a tradeoff where we accept (partially) incorrect result for better speed under some circumstances.

Rational

Some query might not be specific enough to do proper filtering of the list. A query is deemed non-specific if it match a large fraction of the result. (For example, searching for only the word "and")

It's possible to have those non-specific query by accident. (For example, a search-as-you-type scenario while typing "Andrew").

It can be argued that in such a case, the best result is ill-defined, or of such a low utility that a new query will be typed. (For example, continue to type “Andrew”)

Difficulty

So we need to identify non-specific query. We assume those occurs when we match a large number of results. From this, two things need to be defined: "large" and "match".

Large can be a fraction of results say 30-50% range, or even an absolute effort say 10k results.

Match is more complicated:

In fuzzaldrin-plus we use a sub-sequence criteria (every letter must be present in correct order).
In FuzzySearch, we are more lenient in term of accepting a match (out of order words, allow words to have misspelling ).

A possible criteria is to accept as a match anything half as good as the best result so far. (Or any other fraction). The theory behind this is that, for a query with good specificity, the score will decay in an exponential like fashion, while for a non-specific query, the score is almost constant for all results.

Because the process is more lenient, "large" in FuzzySearch might be bigger than an equivalent "large" in fuzzaldrin

Bias

The unprocessed entries will always lie toward the end. Because of this, the maxInner feature will work best if entries are pre-sorted in a decreasing order of "general purpose quality" such as Popularity, Recency, average satisfaction score across all keywords etc...

Ease of use.

If this get implemented, it would be nice to have sensible "battle tested" default. Some benchmark that show if it is worth would be nice too.

Alternative design.

Debounce: This assume the user is typing fast while typing incomplete word and pause to see some results. The idea is to prevent searches to be made while typing fast.
Stop Word Language specific word that are known garbage does not trigger search. Some expression are made mostly of stop word ie (“to be or not to be”)
Cancelable Query. Another early stop strategy where we are aware the search result will be discarded anyway because user has typed another character.

aguynamedben commented 6 years ago

@jeancroy This is a great summary, I'll digest this and get back with something thoughtful. I realize the value of sorting now. Things getting added via .add() are not showing up because maxInners isn't big enough to get that far in the list.

Example:

I have hundreds of Bretts in my name list (bad fake data generator)
One is "Brett Smith - Salesforce"
I add another recored "Brett Smith - LinkedIn"
It's hard to find the "Brett Smith - LinkedIn" record because there are so many Bretts filling up maxInners and the data isn't sorted.

The debounce is epic, and I'm also interested in cancelable query. I'll try upping maxInners now that I have the debounce working well.

Thanks so much for the thoughtful feedback!

jeancroy commented 6 years ago

Yeah the debouce api is a bit hidden, badly documented & tigthly knit to some select list implementation. Sorry about that. It is however an important part of faking performance by doing less work.

I think we have similar use of this library (large address book, mention) , so I'll be able to free some time to improve performance in the semi near future. Maybe even play with cancelation token / web worker and whatnot.

jeancroy / FuzzySearch