Open fabiankessler opened 10 years ago
oops forgot to type a space after 3 ^^
Hi Dennis, thanks for your information.
I personally have no intention now on getting into this, for my purposes the language detection currently works "well enough". But it's good to leave this knowledge here in case someone else has ambitions.
Your numbering for my bullets 1-6 got screwed up, but it's understandable.
-1. four-grams "I too expect 4grams to perform better." I must have meant "gives better language guesses", not "run faster". 4-grams obviously generate more diversity in n-grams.
-5. lower case I agree with lowercasing. The only 2 problematic cases with case I'm aware of is the Turkish i and the German sharp s "ß".
The ß is only an issue when converting to upper case, it is lower case already.
The Turkish i is a problem. For the letters see https://en.wikipedia.org/wiki/Turkish_alphabet Turkish training text can be lowercased with the Turkish locale, so that "I" correctly becomes "ı" not i. However, when checking input text, we don't know yet whether it's Turkish. So there's a discrepancy, I becomes i. Considering this, maybe it's better to purposely convert the training text with the English rules and "break it".
To break the tie between Turkish and other Turkic languages written in the Latin alphabet (Turkmen), the i could be consulted again (to my knowledge only Turkish uses the letters İ and ı).
-6. scoring ""For each letter sequence, the scoring uses the 3-6 most likely languages and their quantized log probabilities." seems like a good optimization."
The training texts don't have the same sizes. It would be impractical to force all to be the same length. Therefore, if the French text was generated from 1000 pages, whereas for Occitan there were only 20, it would be unfair to weight the extracted n-grams equally.
For a text of a certain minimal length, all languages in the system using that script (eg Latin) will have some matching n-grams. Therefore there is a large list of languages with some points. It makes no sense to keep the long tail of low-chance languages. Only keeping the best 3-6 per word or phrase makes sense.
(the numbering in this markup was outsmarting me too...)
1- Yes, as in, in my case I have tested 100 texts for 40 languages with 1,2,3-gram and 4-gram respectively. The accuracies of 3-gram and 4-gram are very similar, both reaching about 100% when the text is long enough.
5- Maybe the occurrence of ß can indicate German, and the occurrence of ı can indicate Turkish :) and maybe all I and ı can be converted into i, like you said, break it.
6- I see, thanks.
See https://code.google.com/p/cld2/ It explains how it performs the language detection.
Of interest is: