feat(developer, common/lmlayer): predictive modeling for morphologically complex languages

jahorton commented 4 years ago

From the Community forums at https://community.software.sil.org/t/keyman-roadmap-march-2020/822/29:

Being able to identify in the tsv file what types of words can take what types of prefixes and in our language’s case clitic pronouns would allow for prediction of at least suffixes, but also better predictions on verbs, nouns, or whatever should follow.

Also, later in the same post:

Hunspell has at least been barking up this tree for more than a decade and Fieldworks grammar data can be used to help with word prediction.

Many languages will require specific greetings to be the way you open up a text message. Having those float to the top of suggestions for a new text would be useful. Things like Hola, Assalam Alaykum, Good Evening. These type of message openers could be designated in a TSV file.

MayuraVerma commented 4 years ago

Instead of hunspell, please consider nuspell. It is newer version of hunspell written in c++ and claims three times faster than huspell.

https://github.com/nuspell/nuspell

jahorton commented 1 year ago

I just wanted to note that compatibility of some sort with Hunspell dictionaries is probably one of our more common requests. Upon investigation, it unfortunately looks anything but straightforward, but this is something we may need to tackle due to its popularity at some point. Just had the topic come up in an email conversation with a user today, and the topic of agglutinative languages - which hunspell is specialized for - came up in our recent team planning as well.

The current state of hunspell+JS, so far as I can tell: there are wrapper libraries for use on npm and a few lighter-weight libraries that target various levels of compatibility with hunspell. The purely-JS libraries seem to be seldomly used and seldomly maintained, which isn't exactly promising.

That said, Hunspell itself is open source (and written in C++, with some reimplementations in other languages) at https://github.com/hunspell/hunspell It offers MPL licensing, which isn't quite MIT, but it's reasonably permissive and should impede any attempts to integrate it or convert parts of it if absolutely necessary.

The file formats of Hunspell dictionaries are plain text, rather than binary. There are some 'codes' of sorts included in those files; some effort would be required to parse them effectively, but dictionary data is at least reasonably accessible and interpretable. Also, one of the two backing file types actually somewhat resembles our wordlist .tsv files - though without frequency data. Here's a link to the data backing a en (en-US) Hunspell dictionary: https://github.com/wooorm/dictionaries/tree/main/dictionaries/en. The .dic file looks to be a pretty comfortable parse, just with an extra metadata tag, while the .aff file... would require some investigation.

MayuraVerma commented 1 year ago

Hunspell is older version/legacy

Nuspell https://nuspell.github.io is new and faster (3.5x than huspell)

https://github.com/nuspell/nuspell

Same team has developed Nuspell, if you want to implement Hunspell, please explore Nuspell. It's new library and much faster.

Nuspell uses Hunspell dictionary.

mcdurdin commented 1 year ago

Nuspell looks significantly more limited than Hunspell? However, the point is moot because we AFAICT we wouldn't want to use either library internally directly in Keyman, due to licensing, dependency requirements, implementation language (C++ vs web technologies), and not being a precise fit for our needs.

We may consider using the dictionary format, or providing conversion from them to Keyman's dictionary format. Understanding their support for agglutinative languages may help us as well.

jahorton commented 1 year ago

Ooh, the nuspell wiki offers some great information about the file formats: https://github.com/nuspell/nuspell/wiki#dictionary-maintenance

That, at least, looks to be very useful.

jahorton commented 8 months ago

A notable JS implementation of Hunspell: https://github.com/cfinke/Typo.js

I also found https://github.com/GitbookIO/hunspell-spellchecker, but it's basically abandoned.
https://github.com/wooorm/nspell also exists and is explicitly MIT-licensed, but also seems (more recently) abandoned.

I am currently unable to find a JS/TS version of nuspell.

Also notable: it appears that Hunspell dictionaries lack any notion of word-frequency or word-weighting. We'd probably need to specify extra source files of our own custom format in order to permit word-weighting - I don't believe that word entries in Hunspell dictionaries have any 'reserved space' we could use to insert related values.

Also notable: I don't see where implementations offer any sort of abstraction similar to the LexiconTraversal interface type we're using to help optimize corrections from keystroke to keystroke; that abstraction really helps us keep performance decent.

keymanapp / keyman

feat(developer, common/lmlayer): predictive modeling for morphologically complex languages #3058