Adding a language - Githubissues

djolereject commented 4 years ago

I wandered if there is possibility that you add short explanation about adding other languages. It just say now that it's "easy to add", but I have no idea how to do that. I'm specifically interested in Serbian, but I believe this explanation could be made general and of value to users of many other languages. Thanks for the great package!

tmalsburg commented 4 years ago

Just added Serbian, see here: bbafdeaf380c41e4546510df7c257b898b702d65

Configuration: To use for example English and Serbian write this into your config:

(setq guess-language-languages '(en sr))

Please let me know whether it works as expected.

Good idea regarding the explanation, but I will have to do this another time. Too busy :(

tmalsburg commented 4 years ago

Added the explanation: 8b029d040f75112b748a71482be2dd13aedcdfaa :)

djolereject commented 4 years ago

First of all thanks for the quick response and even making time for generalized explanation.

Second, there is some problem with this because I tried M-: guess-language on first few paragraphs of serbian cyrilic wikipedia entry and it doesn't work. It always returns Detected laguage: nil. I noticed that you left "German" in (sr . ("serbian" "German")) but I guess that shouldn't affect anything? Maybe it would be easier to do it with serbian latin, maybe it's cyrilic problem?

tmalsburg commented 4 years ago

It's strange that you get nil because the algorithm should always return some language. Our data for Serbian is in Cyrillic script, see here, so that should not be an issue. Are you sure that you used the latest version? Some people install from Elpa and I haven't updated the package there. For testing you have to use the Github version in this repository.

Regarding the German in the config, this indicates to typo mode that the quote characters used in Serbian are the same as in German. Typo mode has no support for Serbian yet, but I'm sure the author would add it if you request it.

djolereject commented 4 years ago

I downloaded this version and somehow ended with guess-lanuage guessing French from Serbian cyrilic. This must be some problem with setup. Does this look like a proper way of using Github version of the package if I dowloaded it to ~/.emacs.d/elpa/guess_tmp:

(add-to-list 'load-path "~/.emacs.d/elpa/guess_tmp")
(require 'guess-language)

tmalsburg commented 4 years ago

Yes, this looks like a configuration problem since French is one of the default languages. You're config looks find but you have to add this line (for English and Serbian):

(setq guess-language-languages '(en sr))

djolereject commented 4 years ago

Oh, right, sorry about that. This works of course, thanks for the swift help. I will try to make PR for Serbian latin also in near future. Thanks again!

tmalsburg commented 4 years ago

Note that we don't have Latin-Serbian language data yet. Do you think it would be enough to just transliterate the Cyrillic trigrams? We'd basically need a Latin version of this (most common trigrams in the language): https://github.com/tmalsburg/guess-language.el/blob/master/trigrams/sr

djolereject commented 4 years ago

Yes, it would be enough because it's the same thing. Only thing that might be a problem is that for some cyrilic letters latin version can be more than one (њ -> nj, љ -> lj). I can do this trigram by hand if you want, is that one file enough for PR or should I add something more?

tmalsburg commented 4 years ago

I can do this trigram by hand if you want, is that one file enough for PR or should I add something more?

That would be great. This file is enough for now. Transliterate with as many character as are needed, and then we see how it works. I think the algorithm doesn't really care whether it is tri-grams or 4-grams, but we'll find out.

Our of interest: is Cyrillic or Latin more common for Serbian?

djolereject commented 4 years ago

It's almost 50/50, everybody uses both and that's pretty unique position among Cyrillic users as far as I know. Thing with Serbian Latin is that you might get some more problems because Croatian and Bosnian are literally same language separated because of the political issues. So it might be really hard to distinguish Serbian Latin from Croatian Latin. With Cyrillic you don't have that kind of ambiguities because closest language would be Macedonian and Bulgarian which are fairly different languages. We'll see how it works in distinguishing Srb/Cro but for most purposes I believe even if it makes that mistake, the package would be helpful.

tmalsburg commented 4 years ago

Thanks for these explanations. That's really fascinating. If Cyrillic and Latin are used equally often, how do you decide which script to use for a document or e-mail or whatever?

Regarding Bosnian and Croatian, you say they are the same language. Does it even matter then whether one or the other language is detected? Or are there small differences after all?

tmalsburg commented 4 years ago

p.s: I found this which answers most of my questions: https://en.wikipedia.org/wiki/Comparison_of_standard_Bosnian,_Croatian,_Montenegrin_and_Serbian

djolereject commented 4 years ago

There is no real algorithm that we use about picking script, some people even use just one for everything and they go by, but I guess it might be a problem if you can't read one of them because they are mixed everywhere. That often happens with Croats who understand everything but can't read Cyrillic which presents problem in cities where street signs are written that way. I don't think there is inverse problem anywhere, because it would be hard to find somebody who can't read Latin, maybe some really old people, but I never saw it.

As for differences between our languages, it's really complicated thing. There are differences in language, but they more look like different dialects than languages. I believe the way Austrians speak German is way more different then Montenegrin compared to Serbian for example. It was all called Serbo-Croatian and all of the former Yugoslavia spoke it except Slovenia and Macedonia. After splitting country, every nation declared it's own language as separate, but then it gets really complicated because Serbian have two dialects and smaller one is almost the same as the Croatian ("ijekavica"). You will often find people who thinks that it's easy to distinguish those two languages by that property - usage of "e" instead of "ije" in lot of common words. This is common mistake, but still mistake because as I said - it's proper Serbian just less common, and it's used in Montenegrin also. Anyway, this line of thought inevitably gets you in academic talk about differences between language and dialect so I'm not sure how productive that is. I believe there could be some specific and common idioms which would make good predictions but I think it's pretty hard task and not worth it. We all understand each other after all and if somebody picks wrong dictionary it still might have correct answer in it.

djolereject commented 4 years ago

I just saw that link and while it's mainly correct it misses some nuances also. For example, they say:

Cyrillic is the official script of the administration in Serbia and Republika Srpska, but the Latin script is most widely used in media and especially on the Internet.

There is some truth in it, but for administrative purposes you can't get rejected because of your script, so it's basically recommendation and not the rule. On the other hand, surely Latin is more common on the Internet, but I wouldn't say it's that more common in the media. Whatever the rule, there is so much regional specifics that it renders all the rules unusable for practical task of recognizing language.

tmalsburg commented 4 years ago

Thanks again for these insights. This suggests that it would ultimately be good to be able to distinguish between Serbian/Kroatian/Bosnian/Montenegrin, but as you noted, it may be difficult on the basis of just trigrams.

tmalsburg / guess-language.el

Adding a language #29