feat(common): Add Support for Simple Misspelling Aliases to lexical models

dyacob commented 1 year ago

While the Unilex project is a valuable resource of real-world, practical, lexicons, it is also a database of misspelled words and their frequencies. Unfortunately, correctly spelled and misspelled words are not marked and so are indistinguishable in their datasets. The task of eliminating misspelled words, and adding their frequency counts to the correctly spelled word, is left to the Keyman Lexical Model maintainer. Which is fine.

I propose here that the lexical model, along with Keyman's predictive text features, support a soft correction of misspelled words. Thus, when a misspelled word is typed by a user, the correctly spelled word will be offered for selection. I believe this can be done with a data-driven approach thus avoiding the need to add language-specific spelling logic to the predictive text engine. The simplest approach that I can think of is to list the misspelled words after the 2nd column of the lexicon .tsv files. For example:

correctWord     54321    badSpelling1    badSpelling2    badSpelling3 ...

Beginning from the 3rd column, all listed words are the common misspellings that are "aliased" to the correct spelling. So for example, when a user types partly or fully badSpelling3 the predictive text feature will offer in its place the correctly spelled correctWord, and apply the same frequency weight to the bad spelling as the correct spelling (by virtue of being an "alias").

Again this is a generic approach that avoids language-specific logic, the lexical model maintainer can use Unilex data and other resources to create the list of incorrect spellings.

Alternatively, the misspelling aliases could reside in a companion file that is referenced from the .ts file and loaded at build time.

mhosken commented 1 year ago

That blocks any further extensibility of that file format. This is fine if you can guarantee you won't ever want to add anything else to the row. I'm chiming in because we anticipate SLDR pointing that some of these files as wordlists for people to use. So I would vote for a separate file for them.

dyacob commented 1 year ago

Yet another option would be to have the 3rd column be a comma-separated list of words. But I also lean to the companion file approach.

DavidLRowe commented 1 year ago

@mcdurdin Should this feature request be in the keyman repo? I can imagine it being part of a larger discussion on the future of lexical models.

mcdurdin commented 1 year ago

Yes, moving this to Keyman repo; it will go into the bucket of feature requests for future versions of lexical models. A companion file of misspellings is probably how we'd implement, because it is something that would be applicable to a wide variety of lexical models, not just to wordlists.

jahorton commented 12 months ago

I propose here that the lexical model, along with Keyman's predictive text features, support a soft correction of misspelled words. Thus, when a misspelled word is typed by a user, the correctly spelled word will be offered for selection. I believe this can be done with a data-driven approach thus avoiding the need to add language-specific spelling logic to the predictive text engine.

Just to be clear... there is no language-specific spelling logic. It's all generalized and language-independent. And I agree with @mcdurdin's take above - something to facilitate cleaning up in-corpus misspellings would be a nice feature.

keymanapp / keyman

feat(common): Add Support for Simple Misspelling Aliases to lexical models #8836