jeancroy / FuzzySearch

:mag: Fast autocomplete suggestion engine using approximate string matching
MIT License
194 stars 32 forks source link

Tuning for large datasets #7

Closed aguynamedben closed 6 years ago

aguynamedben commented 7 years ago

Hi. First, thank you for building this promising library, and thanks for your work on fuzzaldrin-plus as well.

My data set is 1mm records (natural language document titles), and I'm providing an auto-complete search that only needs to return 15 results. With fuzzaldrin-plus, there's a settings maxInners which is detailed here. I've found that with 1mm records, this fuzzaldrin-plus config works amazingly well, returning 15 results from 1mm records in 1-2ms.

let fuzzaldrinOptions = {
  debug: false,
  maxResults: 15,
  usePathScoring: false,
  maxInners: 50,
  key: 'title',
};

This is due to the maxInners option and its High positive count mitigation as detailed in the README. Basically, my interpretation is that maxInners tells the algorithm to abort scoring early in the process if there are a ton of potential matches (i.e. searching for "a" across 1mm records).

I'm looking at the tuning options for FuzzySearch. Is there a similar mitigation in FuzzySearch that lets me efficiently escape from a large number of matches early? I tried playing with the thresh_relative_to_best setting, thinking that putting it higher (i.e. 0.8) would select from a smaller set of records, but that didn't change the performance significantly.

Any advice you could provide on tuning for a search across a large number of records would be very helpful. Maybe FuzzySearch doesn't have that mechanism, as it looks like maxInners was added as a later performance enhancement PR to fuzzaldrin-plus, but I figured it was worth asking.

Thanks again for your work on this library.

jeancroy commented 7 years ago

I can implement a maxInners in this library too. I had no idea that feature had fans !

It would probably be there and break the outer for loop when we reach count. (maybe simply return) Default options are setup about here

Keep in mind this library only include promising result (relative to current best) instead of everything > 0, so an appropriate maxInners might be smaller.

Those two wont stop processing but may make the processing inexpensive:

minimum_match : Minimum score to consider two token are not unrelated thresh_include : To be a candidate score of item must be at least this

jeancroy commented 7 years ago

I tried playing with the thresh_relative_to_best setting, thinking that putting it higher (i.e. 0.8) would select from a smaller set of records, but that didn't change the performance significantly.

It does select a smaller set of record. However by the time you know the score (and thus can compare it to best), you have already spent the computation power, so indeed it should not affect performance. (Except say for final sorting and output transform)

aguynamedben commented 7 years ago

@jeancroy Thank you for the tuning information. That makes sense.

I'll take a crack at the maxInners implementation based on the places in the code you pointed me to. It seems pretty straightforward. Thanks so much for helping me understand. I'll keep ya posted on how it progresses, hopefully in the next day or two.

aguynamedben commented 7 years ago

@jeancroy I got it working! 1mm documents, output_limit set to 15, max_inners set to 25. For a 3-letter query it's returning 15 results in 4ms from 1mm documents. Whooooo!

I'll submit a PR later tonight once I finish manually testing it.

jeancroy commented 7 years ago

max_inners: 25

Is this a percentage or an absolute value ? 25 out of a million seems low. I'd do some test to ensure i can reach documents near the end of the list. Do that test for both short and long/multi word query. Maybe an histogram of popular words frequency.

In any case pre-sorting by a relevancy estimate can help with low max_inners. Maybe by decreasing popularity.

Once this land I'll try to properly package / version & publish this. It's long due :)

jeancroy commented 7 years ago

Another issue is fooling the user into thinking search is instant, when in reality we just suspend the search operation until user pause typing. The idea is that some search results will be immediately discarded with the next character typed so we may as well prevent the search from happening.

This is called debounce and one challenge is finding the proper debounce delay across machines and search tasks. My preferred approach is trying to learn the proper delay (adaptive) .

I do have an async api and instead of

    var fs = new FuzzySearch({source:books, keys:["title","author"] });
    results = fs.search(query)

You can use something like

    var fs = new FuzzySearch({source:books, keys:["title","author"] });
    var interactive_search = FuzzySearch.getInteractive();

    var callback = function(results){ /* do something with results */ }
    var suppressed_cb = function(cached_results){/*the search has been cancelled, you can do something with previous results */}
    var immediate_cb = delayed_cb = callback ;  /* we allow to distinguish between searches that happens immediately and those that happens after keyboard return to calm */

    // on key press
    interactive_search(query, immediate_cb,  suppressed_cb, delayed_cb)    

The callback api is a bit strange but it follow the twitter typeahead convention.

There a callback if the search was immediate immediate_cb and another one if we waited for the keyboard to calm down before we did the search delayed_cb. It's ok for those two to be the same.

There is a callback to be notified if we skip a search. It's ok to be a do nothing function but it must be a function. It does have the previous results as argument.

I may allow a simpler api with a single callback given, and probably should document that part.

aguynamedben commented 7 years ago

@jeancroy Good call on those issues. It's going to take me a bit more time to get the backend of my application elegantly generating the 1mm records to test with, but I'm working on it full-force.

Quick question: Is there currently a way to serialize the FuzzySearch data structures so that I can efficiently save/reload the FuzzySearch object to disk? I'm building an Electron app, so once I index the 1mm records in FuzzySearch, it would be nice to persist the indexes across runs of the app vs. rebuilding it from scratch every time. I've seen other libraries instruct users to call JSON.stringify() on the object or a certain part of the object in order to safely export it.

I've been poking around the code, maybe it doesn't exist right now, but I thought I'd ask. Thanks again for all your guidance.

jeancroy commented 7 years ago

I've seen other libraries instruct users to call JSON.stringify() on the object or a certain part of the object in order to safely export it.

Assuming something like

    var fs = new FuzzySearch({source:books, keys:["title","author"] });

The main content exist in fs.index and maybe corresponding fs.nb_indexed So you could do something like this,

  var serialized = JSON.stringify(fs.index);
  //...
  var fs2 =   new FuzzySearch({keys:["title","author"] }); // do not set a source
 fs2.index = JSON.parse(serialized)
 fs2.nb_indexed = fs2.index.length

Thing like nb_indexed is an implementation detail and was introduced just last week. So I'd be in favor of adding an import / export index public api to fix up private things. Also the index is built using a specific set of keys so it make little sens to not package them together and just hope the user provide the same set of keys.

aguynamedben commented 6 years ago

There's a partial implementation of fullaldrin-plus's maxInners setting here: https://github.com/jeancroy/FuzzySearch/pull/8 if anybody is interested in this, but for now I'm going to punt on it.