rschmitt / heatseeker

A high-performance Selecta clone, written in Rust.
MIT License
214 stars 10 forks source link

Pull out the matcher and make it a standalone crate #14

Closed SirVer closed 7 years ago

SirVer commented 9 years ago

Sorry, this is kind of a brain dump. I hope you'll help me turning your tool into the core fuzzy matcher of my editor WIP. Otherwise I need to reinvent the wheel again :)

Background

Fuzzy matching is a common task and useful in many places. My use case is a better Vim like editor (https://github.com/swiboe/swiboe) which I feel driven to write after being fed up with Vim's (and other editors) shortcomings over the last 7 years writing and maintaining one of Vim's most popular plugins.

Suggested design changes

I outlined an implementation for a fuzzy matcher to test the basic design of Swiboe here. The lessons I learned so far are the following:

What you match is not what you want.

What I mean is that a potential match candidate has a text - displayed to the user and used for scoring. But when it is selected, it might return another value to the caller of the matcher - for example I might match for filenames under a certain directory, but want a fully sanitized absolute path back.

Related, a matching candidate should contain additional information - for example for a UI to display. Continuing with the file path example, you might want to attach if the file is new, added or edited inside a git repository - since the gui will draw it differently.

I think the easiest way of achieving this is to introduce a Candidate struct (as I did in my toy implementation), but make it generic over some user data T that can be queried and contain all this additional information.

Multi threaded matching and optimizations

I tested my toy matcher on ~300k of files and realized that in such cases parallel matching on many CPUs is needed. I think the library should take an optional thread pool on construction or query time and use it to match on many cores.

Related: A big optimization that i stole from YCM's matcher and I am not sure if heatseeker implements is to do a cheap filtering on the characters for each match, before doing the full subsequence match. This works like this: we use a bitset with 127 entries (ASCII only, but that does not mean it does not work for utf, it just means that all strings with utf have to be matched all the time) and set each bit for each character in the string. If a candidate has a zero where a query has a one, the candidate does not need to be matched. This comparison is super fast. If you are interested, this optimization is implemented in my toy matchers.

rschmitt commented 9 years ago

The only issue with extracting the matcher is the fact that I'm not sure what level of stability I can guarantee. Source compatibility isn't too hard, but there are lots of ways to tweak a fuzzy matcher (Heatseeker's smart case, Selecta's boundary matching): I'd have to do one of the following:

  1. Create a very general, highly configurable matching engine
  2. Stop changing Heatseeker's matching
  3. Regularly change the matching crate's scoring algorithm
  4. Not actually use the matching crate in Heatseeker, but rather essentially fork my own library

There are also some stability issues with Rust's scoped threads, which were recently removed from the standard library after they were discovered to be unsound. I'm not sure what the story will be in the future for data parallelism, which is why it's not as simple as "take an optional thread pool."

SirVer commented 9 years ago
  1. seems reasonable to me. I think a fuzzy matcher is by definition used by a human to select something from a big list fast. I personally use youcompleteme, LaunchBar, unite and Ctrl-P fuzzy matchers and am not bothered by their different algorithms too much. What I am getting at is that changing the way the selection is done is an implementation detail and perfectly fine to change all the time, only source compatibility is needed.

About the thread pools: There is a really fresh new crate that implements safe scoped threads on a thread pool: https://crates.io/crates/scoped_threadpool. Otherwise just taking an instance of https://github.com/rust-lang/threadpool should be fine too.

rschmitt commented 9 years ago

I'm vaguely aware of those crates, but I haven't looked at them closely yet. But I am following the relevant RFCs for scoped threads (although what I actually want is data parallelism, which is much higher-level than thread pools).

On Sun, Aug 23, 2015 at 10:22 PM, Holger Rapp notifications@github.com wrote:

  1. seems reasonable to me. I think a fuzzy matcher is by definition used by a human to select something from a big list fast. I personally use youcompleteme, LaunchBar https://www.obdev.at/products/launchbar/index.html, unite and Ctrl-P fuzzy matchers and am not bothered by their different algorithms too much. What I am getting at is that changing the way the selection is done is an implementation detail and perfectly fine to change all the time, only source compatibility is needed.

About the thread pools: There is a really fresh new crate that implements safe scoped threads on a thread pool: https://crates.io/crates/scoped_threadpool. Otherwise just taking an instance of https://github.com/rust-lang/threadpool should be fine too.

— Reply to this email directly or view it on GitHub https://github.com/rschmitt/heatseeker/issues/14#issuecomment-134039925.

rschmitt commented 7 years ago

I think the fst crate will do what you want: https://github.com/BurntSushi/fst

focusaurus commented 4 years ago

Just a note that I discovered this issue hoping to use the heatseeker matcher as a library. I tried fst but fst seems to have regex matching and Levenshtein distance, neither of which are the right algorithm for the kind of "fuzzy" matching I'm looking for.