tmalsburg / guess-language.el

Emacs minor mode that detects the language you're typing in. Automatically switches spell checker. Supports multiple languages per document.
115 stars 14 forks source link

Add Serbian Latin trigram. #30

Closed djolereject closed 4 years ago

djolereject commented 4 years ago

Manually created from sr Cyrillic trigram and tested on small set of files.

As mentioned in comment

tmalsburg commented 4 years ago

Great. How did you handle characters that are transliterated to multiple Latin characters? Did you just drop any Latin characters past the third?

Some question before I merge:

  1. Which default dictionary shall we specify? I'm using Aspell as my spellchecker and as far as I can see it only has a dictionary for Cyrillic Serbian. See here: https://ftp.gnu.org/gnu/aspell/dict/0index.html (The Ubuntu system actually seems to have no Serbian dictionary at all: https://packages.ubuntu.com/xenial/aspell-dictionary)

  2. Which spell checker did you use for testing?

  3. Are the quoting characters the same for Cyrillic and Latin Serbian? As you notices I specified German for typo-mode, since that appears to be the closest available rule set.

djolereject commented 4 years ago

It happened only 3 times in the whole file. I dropped surplus letters in two of them and completely removed third trigram because it wouldn't make sense after edit (it had space and letter that would transpose to two letters so it would be just one Serbian Latin letter)

  1. I used hunspell, that's why I named it sr_LAT, because it's the same name they use for dictionaries. Hunspell have sr and sr_LAT, but I believe they are pulling dictionaries from open office package. I tried using this with aspell and I think names for dictionaries were sr-latn and sr-cyrl, but I'm not sure if that's specific for MacOS.

  2. I used flyspell-correct and tried few files that I wrote in and it worked great after I connected it to dictionaries. This is how setup looked at the end:

    (use-package guess-language
    :after flyspell-correct
    :load-path "elpa/guess_tmp/" ;; Temporary line because updated package is still not on MELPA
    :config
    (add-hook 'flyspell-mode-hook 'guess-language-mode)
    (setq guess-language-languages '(en sr sr_LAT))
    (setq guess-language-langcodes
    '((en . ("en_US" "English")) (sr . ("sr" "Српски")) (sr_LAT . ("sr_LAT" "Srpski")))))
  3. Quotes are the same, I believe German would be good enough for both.

tmalsburg commented 4 years ago

removed third trigram because it wouldn't make sense after edit (it had space and letter that would transpose to two letters so it would be just one Serbian Latin letter)

I'm not sure this is actually a problem (spaces are informative, too) and in principle it's best to have the same number of trigrams for all language. If one language has less trigrams it's a bit disadvantaged because the algorithms just counts matches. However, one trigram more or less will likely not make a practical different.

I think names for dictionaries were sr-latn and sr-cyrl, but I'm not sure if that's specific for MacOS.

Yes, it appears that the names differ from OS to OS. Annoying but there's not much we can do I'm afraid.

(sr . ("sr" "Српски")) (sr_LAT . ("sr_LAT" "Srpski")

Typo mode doesn't have support for Serbian (yet). So that would have to be (sr . ("sr" "German")) (sr_LAT . ("sr_LAT" "German") for the time being. If you don't use typo-mode, it doesn't matter, but if you do your config might result in an error.

Quotes are the same, I believe German would be good enough for both.

Adding support for Serbian to typo-mode would be trivial, see here. Perhaps make a PR there? Would be cool to have more complete support for Serbian in Emacs.

Thanks for your contributions!

tmalsburg commented 4 years ago

I just noticed that we have duplicated trigrams (e n and ra). I guess one day, someone will have to compute proper Latin trigrams for Serbian from Latin texts. But for now it probably works well enough.

djolereject commented 4 years ago

I'm not sure this is actually a problem (spaces are informative, too) and in principle it's best to have the same number of trigrams for all language. If one language has less trigrams it's a bit disadvantaged because the algorithms just counts matches. However, one trigram more or less will likely not make a practical different.

I thought it works differently so it would probably be better solution to repeat something. Anyway, I had more chance to try it out and never found any mistake in recognizing Serbian. Guess it's good enough for now.

Typo mode doesn't have support for Serbian (yet). So that would have to be (sr . ("sr" "German")) (sr_LAT . ("sr_LAT" "German") for the time being. If you don't use typo-mode, it doesn't matter, but if you do your config might result in an error.

I thought that last part is what user see as title. Good catch, thanks.

Adding support for Serbian to typo-mode would be trivial, see here. Perhaps make a PR there? Would be cool to have more complete support for Serbian in Emacs.

I might do that in next days and when I do I will make PR here also to acknowledge this change.

Thanks for your contributions!

Pleasure was mine, thanks for a great package.

tmalsburg commented 4 years ago

I might do that in next days and when I do I will make PR here also to acknowledge this change.

Thank you!

djolereject commented 4 years ago

Adding support for Serbian to typo-mode would be trivial, see here. Perhaps make a PR there? Would be cool to have more complete support for Serbian in Emacs.

Done

tmalsburg commented 4 years ago

Great! :)

tmalsburg commented 4 years ago

Changed the defaults accordingly: https://github.com/tmalsburg/guess-language.el/commit/e216c677a889b1c8740d52852c7dd9ec636164eb