streetsidesoftware / vscode-cspell-dict-extensions

VS Code Spelling Dictionary Extensions
MIT License
123 stars 42 forks source link

Publish Estonian #165

Closed Jason3S closed 2 years ago

Jason3S commented 3 years ago

This issue is to keep track of the things that need to be done to publish the Estonian extension.

  1. Add sample files.
  2. Test extension
  3. Publish
IgorKrupenja commented 3 years ago

I can assist you with a sample and testing. Should I simply create a PR with a sample, similarly to the PR that added Norwegian? I was also thinking about the intro from article on Estonia in Estonian Wikipedia but it's rather short and the longest paragraph is simply a collection of dates and names of international organisations. Maybe first few paragraphs from History section of the same article is better.

Jason3S commented 3 years ago

@igor-krupenja,

That would be wonderful! As long as it isn't too political.

Yes, a PR like the Norwegian one would be great.

IgorKrupenja commented 3 years ago

I have created a draft PR for this, trying to do stuff like in the Norwegian PR. But I could have missed something, please have a look. The history sample text from Wikipedia deals with ancient history (5000+ years ago) so should not be political.

However, I ran into some issues with testing the extension. With the sample history text, it seems that there are too many words that are marked as misspelled:

image

I decided to try the dictionary files from the cspell-dicts repo with hunspell CLI and macOS's TextEdit. Both were not able to correctly detect the dictionaries. I think that the problem is that the Estonian dictionary in cspell-dicts repo is in ISO 8859-13 encoding and not UTF-8. I quickly tried hunspell and TextEdit with the same dictionary in UTF-8 and got reasonable results, screenshot from TextEdit:

image

The README for the Estonian dictionary says:

All dictionaries are encoded in the ISO-8859-15 (Latin-9) character-set, which is absolutely necessary to accommodate the plethora of foreign words featuring S- and Z-caron that see daily usage in the Estonian language.

But this is plain wrong, characters š and ž have been supported by UTF-8 for ages. I have also specifically tried spell-checking Estonian words with those characters from hunspell CLI and TextEdit using UTF-8 encoded Estonian dictionary and it works well. Shall I convert the Estonian dictionary in cspell-dicts repo to UTF-8? Will test if it actually works better and we are not hitting some other problem.

Jason3S commented 3 years ago

@igor-krupenja,

Looks like the dictionary needs to be updated. I'm not sure where @magnushiie got it from. I'm all for using a UTF-8 dictionary.

IgorKrupenja commented 3 years ago

@Jason3S Good, I will try to look into in the next couple of days.