Publish Estonian - Githubissues

Jason3S commented 3 years ago

This issue is to keep track of the things that need to be done to publish the Estonian extension.

Add sample files.
Test extension
Publish

IgorKrupenja commented 3 years ago

I can assist you with a sample and testing. Should I simply create a PR with a sample, similarly to the PR that added Norwegian? I was also thinking about the intro from article on Estonia in Estonian Wikipedia but it's rather short and the longest paragraph is simply a collection of dates and names of international organisations. Maybe first few paragraphs from History section of the same article is better.

Jason3S commented 3 years ago

@igor-krupenja,

That would be wonderful! As long as it isn't too political.

Yes, a PR like the Norwegian one would be great.

IgorKrupenja commented 3 years ago

I have created a draft PR for this, trying to do stuff like in the Norwegian PR. But I could have missed something, please have a look. The history sample text from Wikipedia deals with ancient history (5000+ years ago) so should not be political.

However, I ran into some issues with testing the extension. With the sample history text, it seems that there are too many words that are marked as misspelled:

I decided to try the dictionary files from the cspell-dicts repo with hunspell CLI and macOS's TextEdit. Both were not able to correctly detect the dictionaries. I think that the problem is that the Estonian dictionary in cspell-dicts repo is in ISO 8859-13 encoding and not UTF-8. I quickly tried hunspell and TextEdit with the same dictionary in UTF-8 and got reasonable results, screenshot from TextEdit:

The README for the Estonian dictionary says:

All dictionaries are encoded in the ISO-8859-15 (Latin-9) character-set, which is absolutely necessary to accommodate the plethora of foreign words featuring S- and Z-caron that see daily usage in the Estonian language.

But this is plain wrong, characters š and ž have been supported by UTF-8 for ages. I have also specifically tried spell-checking Estonian words with those characters from hunspell CLI and TextEdit using UTF-8 encoded Estonian dictionary and it works well. Shall I convert the Estonian dictionary in cspell-dicts repo to UTF-8? Will test if it actually works better and we are not hitting some other problem.

Jason3S commented 3 years ago

@igor-krupenja,

Looks like the dictionary needs to be updated. I'm not sure where @magnushiie got it from. I'm all for using a UTF-8 dictionary.

IgorKrupenja commented 3 years ago

@Jason3S Good, I will try to look into in the next couple of days.

streetsidesoftware / vscode-cspell-dict-extensions

Publish Estonian #165