Return a matching list of words for a prefix

huahaiy commented 1 year ago

maybe sorted by length then frequency.

den1k commented 1 year ago

Here's an initial shot at this function. It's a modification of match-prefix?:

(defn- add-to-trie ;; copied becuase it's private
  [trie s]
  (assoc-in trie s (merge (get-in trie s) {:end true})))
=> #'yam.db/add-to-trie

(defn prefix-matches
  [trie s]
  (when-not (clojure.string/blank? s)
    (loop [t trie ks (seq s)]
      (if ks
        (let [k (first ks)]
          (if-let [t (t k)]
            (recur t (next ks))
            false))
        (letfn [(completions [t s]
                  (reduce-kv (fn [out k v]
                               (if (= k :end)
                                 out
                                 (let [completion (str s k)]
                                   (-> out
                                       (conj completion)
                                       (into (completions v completion))))))
                             #{}
                             t))]
          (completions t s))))))
=> #'yam.db/prefix-matches

(def t (-> (reduce add-to-trie {} ["autocomplete" "automobile"]))) ;; mini trie for testing
=> #'yam.db/t

(prefix-matches t "auto")
=>
#{"autom"
  "autocomple"
  "autocomplete"
  "autocomplet"
  "autocomp"
  "autocompl"
  "autocom"
  "automobil"
  "automo"
  "automobile"
  "autoco"
  "automob"
  "autoc"
  "automobi"}

One issue is that it is unaware of what constitutes a valid word (word in symspell's unigram) which means that on it's own it would not be sufficient for an autocomplete feature as mentioned in https://github.com/juji-io/datalevin/issues/165. There is likely an easy solution (though I'm not sure about performance) where completions are looked up in symspell's unigram. Here's a sketch of that:

(require '[symspell-clj.core :as sp])

(def spc
  (sp/new-spellchecker
    "en_unigrams.txt" "en_bigrams.txt"
    {:custom-dictionary ["autocomplete" "automobile"]}))

(let [unigram-set (.keySet (.getUnigramLexicon (.symspell spc)))]
  (filter #(.contains unigram-set %)
          (prefix-matches t "auto")))
=> ("autocomplete" "automobile")

@huahaiy what do you think, is this the right approach?

One other thing to ponder, the above should work well for autocomplete but not fuzzy autocomplete 🤔

huahaiy commented 1 year ago

Looks good.

new-spellchecker already loaded all the words in the unigram into prefix-trie fields of the SpellChecker object. If you add a method prefix-matched-words to the ISpellChecker protocol, your implementation can use that field directly.

den1k commented 1 year ago

Sounds good. What about performance? The small trie of only two words already yielded 14 suggestions. A real world trie will likely yield 10 thousand+ for some prefixes. Additionally it does not work for fuzzy input.

huahaiy commented 1 year ago

I am not sure it is necessary to support fuzzy input for autocomplete though. Autocomplete is to help people input the correct words, as such it should return correct words in the dictionary.

I won't start autocomplete until after 3 characters already being typed, that would reduce the returned list significantly. In addition, I would sort the list based on frequency (most words in the unigram has that information) and maybe length, and only show the top ones.

juji-io / symspell-clj

Return a matching list of words for a prefix #1