Open aguynamedben opened 6 years ago
Oh the other obvious improvement is to add tests for max_inners. I may do that when this spring is over... and add more tests in general. I'm happy to help with this codebase wherever necessary. I want to understand the algorithm better and need to read some of the whitepapers, but I can also do grunt work.
Note: the first commit adds a test (something I was getting mixed up about with the options) and also upgrades some package based on warnings I saw during yarn install
.
Thanks for this, I'll review later today or later in the week. For now I've made a brain dump of what I think I know on this topic.
This is a tradeoff where we accept (partially) incorrect result for better speed under some circumstances.
Some query might not be specific enough to do proper filtering of the list. A query is deemed non-specific if it match a large fraction of the result. (For example, searching for only the word "and")
It's possible to have those non-specific query by accident. (For example, a search-as-you-type scenario while typing "Andrew").
It can be argued that in such a case, the best result is ill-defined, or of such a low utility that a new query will be typed. (For example, continue to type “Andrew”)
So we need to identify non-specific query. We assume those occurs when we match a large number of results. From this, two things need to be defined: "large" and "match".
Large can be a fraction of results say 30-50% range, or even an absolute effort say 10k results.
Match is more complicated:
A possible criteria is to accept as a match anything half as good as the best result so far. (Or any other fraction). The theory behind this is that, for a query with good specificity, the score will decay in an exponential like fashion, while for a non-specific query, the score is almost constant for all results.
Because the process is more lenient, "large" in FuzzySearch might be bigger than an equivalent "large" in fuzzaldrin
The unprocessed entries will always lie toward the end. Because of this, the maxInner feature will work best if entries are pre-sorted in a decreasing order of "general purpose quality" such as Popularity, Recency, average satisfaction score across all keywords etc...
If this get implemented, it would be nice to have sensible "battle tested" default. Some benchmark that show if it is worth would be nice too.
Debounce: This assume the user is typing fast while typing incomplete word and pause to see some results. The idea is to prevent searches to be made while typing fast.
Stop Word Language specific word that are known garbage does not trigger search. Some expression are made mostly of stop word ie (“to be or not to be”)
Cancelable Query. Another early stop strategy where we are aware the search result will be discarded anyway because user has typed another character.
@jeancroy This is a great summary, I'll digest this and get back with something thoughtful. I realize the value of sorting now. Things getting added via .add() are not showing up because maxInners isn't big enough to get that far in the list.
Example:
The debounce is epic, and I'm also interested in cancelable query. I'll try upping maxInners now that I have the debounce working well.
Thanks so much for the thoughtful feedback!
Yeah the debouce api is a bit hidden, badly documented & tigthly knit to some select list implementation. Sorry about that. It is however an important part of faking performance by doing less work.
I think we have similar use of this library (large address book, mention) , so I'll be able to free some time to improve performance in the semi near future. Maybe even play with cancelation token / web worker and whatnot.
This mirrors the maxInners argument in fuzz-aldrin-plus.
@jeancroy guided me on the implementation here: #7
@jeancroy, the only thing I think could improve this would be moving the inserted for loop bail out to be higher, but it does also kind of make sense to have it down near (what I assume) is the more expensive part of the code. It's working great for me with debounce, max_inners of 200 on 100k names.
I see what you were saying in the issue a while back about setting it way too small (i.e. 25) and not being able to find documents late in the list. I tweaked a few other variables and have it working well with debounce and these settings:
For 100k names, using the debounce API:
Thanks so much for working on this library and helping me understand it. 👍