Open dyacob opened 1 year ago
That blocks any further extensibility of that file format. This is fine if you can guarantee you won't ever want to add anything else to the row. I'm chiming in because we anticipate SLDR pointing that some of these files as wordlists for people to use. So I would vote for a separate file for them.
Yet another option would be to have the 3rd column be a comma-separated list of words. But I also lean to the companion file approach.
@mcdurdin Should this feature request be in the keyman repo? I can imagine it being part of a larger discussion on the future of lexical models.
Yes, moving this to Keyman repo; it will go into the bucket of feature requests for future versions of lexical models. A companion file of misspellings is probably how we'd implement, because it is something that would be applicable to a wide variety of lexical models, not just to wordlists.
I propose here that the lexical model, along with Keyman's predictive text features, support a soft correction of misspelled words. Thus, when a misspelled word is typed by a user, the correctly spelled word will be offered for selection. I believe this can be done with a data-driven approach thus avoiding the need to add language-specific spelling logic to the predictive text engine.
Just to be clear... there is no language-specific spelling logic. It's all generalized and language-independent. And I agree with @mcdurdin's take above - something to facilitate cleaning up in-corpus misspellings would be a nice feature.
While the Unilex project is a valuable resource of real-world, practical, lexicons, it is also a database of misspelled words and their frequencies. Unfortunately, correctly spelled and misspelled words are not marked and so are indistinguishable in their datasets. The task of eliminating misspelled words, and adding their frequency counts to the correctly spelled word, is left to the Keyman Lexical Model maintainer. Which is fine.
I propose here that the lexical model, along with Keyman's predictive text features, support a soft correction of misspelled words. Thus, when a misspelled word is typed by a user, the correctly spelled word will be offered for selection. I believe this can be done with a data-driven approach thus avoiding the need to add language-specific spelling logic to the predictive text engine. The simplest approach that I can think of is to list the misspelled words after the 2nd column of the lexicon
.tsv
files. For example:Beginning from the 3rd column, all listed words are the common misspellings that are "aliased" to the correct spelling. So for example, when a user types partly or fully
badSpelling3
the predictive text feature will offer in its place the correctly spelledcorrectWord
, and apply the same frequency weight to the bad spelling as the correct spelling (by virtue of being an "alias").Again this is a generic approach that avoids language-specific logic, the lexical model maintainer can use Unilex data and other resources to create the list of incorrect spellings.
Alternatively, the misspelling aliases could reside in a companion file that is referenced from the
.ts
file and loaded at build time.