lewang / flx

Fuzzy matching for Emacs ... a la Sublime Text.
GNU General Public License v3.0
518 stars 37 forks source link

Use O(n*m) sequence alignment algorithm #98

Open MaskRay opened 6 years ago

MaskRay commented 6 years ago

tl;dr Using a sequence alignment algorithm can improve the fuzzy matching quality, as what fzf does. But this approach is much simpler without sacrifice of quality. See the following example:

    //////////// Ranks(pattern,  {best_candidate, ........, worst_candidate})
    // case
    CHECK(Ranks("monad", {"monad", "Monad", "mONAD"}));
    // initials
    CHECK(Ranks("ab", {"ab", "aoo_boo", "acb"}));
    CHECK(Ranks("CC", {"CamelCase", "camelCase", "camelcase"}));
    CHECK(Ranks("cC", {"camelCase", "CamelCase", "camelcase"}));
    CHECK(Ranks("c c", {"camel case", "camelCase", "CamelCase", "camelcase",
                        "camel ace"}));
    CHECK(Ranks("Da.Te",
                {"Data.Text", "Data.Text.Lazy", "Data.Aeson.Encoding.text"}));
    CHECK(Ranks("foo bar.h", {"foo/bar.h", "foobar.h"}));
    // prefix
    CHECK(Ranks("is", {"isIEEE", "inSuf"}));
    // shorter
    CHECK(Ranks("ma", {"map", "many", "maximum"}));
    CHECK(Ranks("print", {"printf", "sprintf"}));
    // score(PRINT) = kMinScore
    CHECK(Ranks("ast", {"ast", "AST", "INT_FAST16_MAX"}));
    // score(PRINT) > kMinScore
    CHECK(Ranks("Int", {"int", "INT", "PRINT"}));

What helm-M-x returns for the pattern c m:

0. c++-mode          ; deserved champion
1. customize
2. eldoc-mode
4. company-mode   ;;this should be the runner-up, because `c m` are Head-position matching

counsel-M-x:

0. customize-group   ; m is Tail-position matching, this should be given lower weight
1. chmod                ; this is short so it comes first, but `m` is a Tail position
4. c++-mode        ; our champion shouldn't be here
5. css-mode
..........

I have read many fuzzy matching algorithms, including the one used by fzf. It is too complex and also messes about heuristics.

I adapted and simplified the fuzzy matching algorithm used in clangd and created https://github.com/MaskRay/cquery/blob/fuzzy/src/fuzzy_match.cc

It takes many factors into account:

While keeps it conceptually simple in the main loop:

  dp[0][0][0] = dp[0][0][1] = 0;
  for (int j = 0; j < n; j++) {
    dp[0][j + 1][0] = dp[0][j][0] + MissScore(j, false);
    dp[0][j + 1][1] = kMinScore * 2;
  }
  for (int i = 0; i < int(pat.size()); i++) {
    int(*pre)[2] = dp[i & 1];
    int(*cur)[2] = dp[(i + 1) & 1];
    cur[i][0] = cur[i][1] = kMinScore;
    for (int j = i; j < n; j++) {
      cur[j + 1][0] = std::max(cur[j][0] + MissScore(j, false),
                               cur[j][1] + MissScore(j, true));
      // For the first char of pattern, apply extra restriction to filter bad
      // candidates (e.g. |int| in |PRINT|)
      if (low_pat[i] == low_text[j] &&
          (i || text_role[j] != Tail || pat[i] == text[j])) {
        cur[j + 1][1] = std::max(pre[j][0] + MatchScore(i, j, false),
                                 pre[j][1] + MatchScore(i, j, true));
      } else
        cur[j + 1][1] = kMinScore * 2;
    }
  }

The heuristics are gathered and coded in one place FuzzyMatcher::MatchScore, instead of scattering all over the code (which is unfortunately the case for many other fuzzy matching algorithms). I have also done something similar for rofi which you can experience by appending -sort -matching fuzzy to its command line.

I'll explain briefly about the current heuristics. Similar rules can be added.

int FuzzyMatcher::MatchScore(int i, int j, bool last) {
  int s = 0;
  // Bonus for case match
  if (pat[i] == text[j]) {
    s++;
    // Bonus for prefix match or case match when the pattern contains upper-case letters
    if ((pat_set & 1 << Upper) || i == j)
      s++;
  }
  // For initial positions of pattern words
  if (pat_role[i] == Head) {
    // Bonus if it is matched to an initial position of some text word
    if (text_role[j] == Head)
      s += 30;
    // Penalty for non-initial positions
    else if (text_role[j] == Tail)
      s -= 10;
  }
  if (text_role[j] == Tail && i && !last)
    s -= 30;
  // The first character of pattern is matched to a non-initial position in the text
  if (i == 0 && text_role[j] == Tail)
    s -= 40;
  return s;
}

One caveat is that for space complexity I use compressed two-row dp[2][n][2] instead of full dp[m][n][2]. This way, we lose the ability to reconstruct the optimal path (matching positions).

Given pattern cm, the sequence alignment algorithm will prefer [c]ompany-[m]ode over [c]o[m]pany-mode, as the initial positions of words are given more weight. However, I do not store the full sequence alignment table dp[m][n][2] so it is not possible to reconstruct the optimal path. But this should only cause a minor visual glitch.

I'm not that good at elisp (implying that I cannot implement this ....) but I believe it is a huge improvement. Appreciated if someone could implement this..........

If you speak Chinese, you may read my complain in https://emacs-china.org/t/topic/5368 ............

ivy integration

PythonNut commented 4 years ago

Sorry if this wasn't clear, but I believe flx already implements a O(nm) matching algorithm. The implementation is a bit non-traditional, but the asymptotic complexity should be fine (as far as I could tell when I wrote it). There's definitely room to do fuzzy matching more quickly by a constant factor, and I wouldn't be surprised if a large constant speedup could be achieved.