Closed Jason3S closed 2 years ago
I can assist you with a sample and testing. Should I simply create a PR with a sample, similarly to the PR that added Norwegian? I was also thinking about the intro from article on Estonia in Estonian Wikipedia but it's rather short and the longest paragraph is simply a collection of dates and names of international organisations. Maybe first few paragraphs from History section of the same article is better.
@igor-krupenja,
That would be wonderful! As long as it isn't too political.
Yes, a PR like the Norwegian one would be great.
I have created a draft PR for this, trying to do stuff like in the Norwegian PR. But I could have missed something, please have a look. The history sample text from Wikipedia deals with ancient history (5000+ years ago) so should not be political.
However, I ran into some issues with testing the extension. With the sample history text, it seems that there are too many words that are marked as misspelled:
I decided to try the dictionary files from the cspell-dicts
repo with hunspell CLI and macOS's TextEdit. Both were not able to correctly detect the dictionaries. I think that the problem is that the Estonian dictionary in cspell-dicts
repo is in ISO 8859-13 encoding and not UTF-8. I quickly tried hunspell and TextEdit with the same dictionary in UTF-8 and got reasonable results, screenshot from TextEdit:
The README for the Estonian dictionary says:
All dictionaries are encoded in the ISO-8859-15 (Latin-9) character-set, which is absolutely necessary to accommodate the plethora of foreign words featuring S- and Z-caron that see daily usage in the Estonian language.
But this is plain wrong, characters š and ž have been supported by UTF-8 for ages. I have also specifically tried spell-checking Estonian words with those characters from hunspell CLI and TextEdit using UTF-8 encoded Estonian dictionary and it works well. Shall I convert the Estonian dictionary in cspell-dicts
repo to UTF-8? Will test if it actually works better and we are not hitting some other problem.
@igor-krupenja,
Looks like the dictionary needs to be updated. I'm not sure where @magnushiie got it from. I'm all for using a UTF-8 dictionary.
@Jason3S Good, I will try to look into in the next couple of days.
This issue is to keep track of the things that need to be done to publish the Estonian extension.