Open huahaiy opened 1 year ago
Here's an initial shot at this function. It's a modification of match-prefix?
:
(defn- add-to-trie ;; copied becuase it's private
[trie s]
(assoc-in trie s (merge (get-in trie s) {:end true})))
=> #'yam.db/add-to-trie
(defn prefix-matches
[trie s]
(when-not (clojure.string/blank? s)
(loop [t trie ks (seq s)]
(if ks
(let [k (first ks)]
(if-let [t (t k)]
(recur t (next ks))
false))
(letfn [(completions [t s]
(reduce-kv (fn [out k v]
(if (= k :end)
out
(let [completion (str s k)]
(-> out
(conj completion)
(into (completions v completion))))))
#{}
t))]
(completions t s))))))
=> #'yam.db/prefix-matches
(def t (-> (reduce add-to-trie {} ["autocomplete" "automobile"]))) ;; mini trie for testing
=> #'yam.db/t
(prefix-matches t "auto")
=>
#{"autom"
"autocomple"
"autocomplete"
"autocomplet"
"autocomp"
"autocompl"
"autocom"
"automobil"
"automo"
"automobile"
"autoco"
"automob"
"autoc"
"automobi"}
One issue is that it is unaware of what constitutes a valid word (word in symspell's unigram) which means that on it's own it would not be sufficient for an autocomplete feature as mentioned in https://github.com/juji-io/datalevin/issues/165. There is likely an easy solution (though I'm not sure about performance) where completions are looked up in symspell's unigram. Here's a sketch of that:
(require '[symspell-clj.core :as sp])
(def spc
(sp/new-spellchecker
"en_unigrams.txt" "en_bigrams.txt"
{:custom-dictionary ["autocomplete" "automobile"]}))
(let [unigram-set (.keySet (.getUnigramLexicon (.symspell spc)))]
(filter #(.contains unigram-set %)
(prefix-matches t "auto")))
=> ("autocomplete" "automobile")
@huahaiy what do you think, is this the right approach?
One other thing to ponder, the above should work well for autocomplete but not fuzzy autocomplete 🤔
Looks good.
new-spellchecker
already loaded all the words in the unigram into prefix-trie
fields of the SpellChecker
object. If you add a method prefix-matched-words
to the ISpellChecker
protocol, your implementation can use that field directly.
Sounds good. What about performance? The small trie of only two words already yielded 14
suggestions. A real world trie will likely yield 10 thousand+ for some prefixes. Additionally it does not work for fuzzy input.
I am not sure it is necessary to support fuzzy input for autocomplete though. Autocomplete is to help people input the correct words, as such it should return correct words in the dictionary.
I won't start autocomplete until after 3 characters already being typed, that would reduce the returned list significantly. In addition, I would sort the list based on frequency (most words in the unigram has that information) and maybe length, and only show the top ones.
maybe sorted by length then frequency.