pluots / zspell

A spellchecking library and executable written in Rust
Other
43 stars 4 forks source link

Misspelled word suggestions #45

Open tgross35 opened 1 year ago

tgross35 commented 1 year ago

It may be good to provide a SuggestionConfig or similar struct that could be passed as an argument to our .suggest function (or similar). There are some different functionalities we could use:

Possible API (from #16)

fn check_with_suggestions(&self, s: &str) -> Suggestions

enum Suggestions {
    Correct
    Incorrect(Vec<&str>)
}
tgross35 commented 1 year ago

We should probably look at how hunspell does this

tgross35 commented 1 year ago

This site describes how Hunspell works https://zverok.space/blog/2021-01-28-spellchecker-5.html

  1. Change the word to the uppercase (see also “Word case” sub-section below);
  2. Replace common misspellings, like “f”→”ph” and vice versa, defined by REP table from aff-file;
  3. Split the word in two parts in every position (with space or dash), to be tested as a single dictionary entry, like “ad hoc” (see also #13 below);
  4. Replace related chars, like “a”, “å”, “ä”, defined by MAP table from aff-file;
  5. Swap every two adjacent letters, oh, and for 4- and 5-letter words also try two swaps: “ahev” → “have”;
  6. Swap two non-adjacent letters (up to distance 4);
  7. Replace every letter with the adjacent on the keyboard, e.g. “miraclw” → “miracle”. The keyboard layout is defined by KEY directive in aff-file; and, on the same step, with the capitalized version of the character (“paris” → “Paris”, but not vice versa), also considered as a possible keyboard-related error;
  8. Remove every letter in turn;
  9. Insert every letter from the language’s alphabet (defined by TRY directive in aff-file) into every position;
  10. Move every letter forward and backward into all possible positions;
  11. Replace every letter with every other letter from the language’s alphabet;
  12. Find a duplicated pair of letters and remove it: “chicicken” → “chicken”;
  13. Split the word in two in every position, to be tested as two separate words (see also #3 above).
tgross35 commented 1 year ago

http://aspell.net/test/cur/

cpu commented 11 months ago

:wave: I would be interested in using this crate to replace hunspell-rs/hunspell-sys with a memory safe alternative, but my use case requires misspelled word suggestions. Just leaving a comment here in case knowing it would be useful to a downstream project helps motivate the work.

Thank you!

tgross35 commented 11 months ago

@cpu Thanks for the feedback! I do indeed have prototype suggestions working, but some changes are needed before it is reliable :) in particular, I probably have to unblock https://github.com/pluots/zspell/issues/54 first