Open jahorton opened 4 years ago
Instead of hunspell, please consider nuspell. It is newer version of hunspell written in c++ and claims three times faster than huspell.
I just wanted to note that compatibility of some sort with Hunspell dictionaries is probably one of our more common requests. Upon investigation, it unfortunately looks anything but straightforward, but this is something we may need to tackle due to its popularity at some point. Just had the topic come up in an email conversation with a user today, and the topic of agglutinative languages - which hunspell
is specialized for - came up in our recent team planning as well.
The current state of hunspell+JS, so far as I can tell: there are wrapper libraries for use on npm
and a few lighter-weight libraries that target various levels of compatibility with hunspell. The purely-JS libraries seem to be seldomly used and seldomly maintained, which isn't exactly promising.
That said, Hunspell itself is open source (and written in C++, with some reimplementations in other languages) at https://github.com/hunspell/hunspell It offers MPL licensing, which isn't quite MIT, but it's reasonably permissive and should impede any attempts to integrate it or convert parts of it if absolutely necessary.
The file formats of Hunspell dictionaries are plain text, rather than binary. There are some 'codes' of sorts included in those files; some effort would be required to parse them effectively, but dictionary data is at least reasonably accessible and interpretable. Also, one of the two backing file types actually somewhat resembles our wordlist .tsv files - though without frequency data. Here's a link to the data backing a en
(en-US
) Hunspell dictionary: https://github.com/wooorm/dictionaries/tree/main/dictionaries/en. The .dic
file looks to be a pretty comfortable parse, just with an extra metadata tag, while the .aff
file... would require some investigation.
Hunspell is older version/legacy
Nuspell https://nuspell.github.io is new and faster (3.5x than huspell)
https://github.com/nuspell/nuspell
Same team has developed Nuspell, if you want to implement Hunspell, please explore Nuspell. It's new library and much faster.
Nuspell uses Hunspell dictionary.
Nuspell looks significantly more limited than Hunspell? However, the point is moot because we AFAICT we wouldn't want to use either library internally directly in Keyman, due to licensing, dependency requirements, implementation language (C++ vs web technologies), and not being a precise fit for our needs.
We may consider using the dictionary format, or providing conversion from them to Keyman's dictionary format. Understanding their support for agglutinative languages may help us as well.
Ooh, the nuspell
wiki offers some great information about the file formats: https://github.com/nuspell/nuspell/wiki#dictionary-maintenance
That, at least, looks to be very useful.
A notable JS implementation of Hunspell: https://github.com/cfinke/Typo.js
I am currently unable to find a JS/TS version of nuspell
.
Also notable: it appears that Hunspell dictionaries lack any notion of word-frequency or word-weighting. We'd probably need to specify extra source files of our own custom format in order to permit word-weighting - I don't believe that word entries in Hunspell dictionaries have any 'reserved space' we could use to insert related values.
Also notable: I don't see where implementations offer any sort of abstraction similar to the LexiconTraversal
interface type we're using to help optimize corrections from keystroke to keystroke; that abstraction really helps us keep performance decent.
From the Community forums at https://community.software.sil.org/t/keyman-roadmap-march-2020/822/29:
Also, later in the same post: