optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
569 stars 165 forks source link

Look at "Compact Language Detector 2" (cld2) #12

Open fabiankessler opened 10 years ago

fabiankessler commented 10 years ago

See https://code.google.com/p/cld2/ It explains how it performs the language detection.

Of interest is:

dennis97519 commented 9 years ago
  1. When using 1-nearest neighbour classifier (I don't understand how Bayes classifier works, lol) with cosine distance, 3-gram performed equally well as 4-gram and took half the time 4-gram takes. You can play with the tatoeba.org sentence database. The original Japanese author Nakatani Shuyo used Wikipedia text (I remember he said "wikipedia abstract xml" or something somewhere) to create the training sample
  2. Yup, I think script detection first then decide whether to detect language
  3. That intrigues me as well
  4. The author has explained the three different token algorithms are: a. Detect language based on script if the script uniquely identifies language b. Detect language based on 1-gram if text section is CJK text (Chinese, Japanese) c. Detect language based on 4-gram otherwise (I think that's the meaning of three different token algorithms?)
  5. I'm not aware that this library's n-grams are caseful O.o thought it's a common practice to lowercase everything lol
  6. I'm too simple and lazy minded to understand this ^^

oops forgot to type a space after 3 ^^

fabiankessler commented 9 years ago

Hi Dennis, thanks for your information.

I personally have no intention now on getting into this, for my purposes the language detection currently works "well enough". But it's good to leave this knowledge here in case someone else has ambitions.

Your numbering for my bullets 1-6 got screwed up, but it's understandable.

-1. four-grams "I too expect 4grams to perform better." I must have meant "gives better language guesses", not "run faster". 4-grams obviously generate more diversity in n-grams.

-5. lower case I agree with lowercasing. The only 2 problematic cases with case I'm aware of is the Turkish i and the German sharp s "ß".

The ß is only an issue when converting to upper case, it is lower case already.

The Turkish i is a problem. For the letters see https://en.wikipedia.org/wiki/Turkish_alphabet Turkish training text can be lowercased with the Turkish locale, so that "I" correctly becomes "ı" not i. However, when checking input text, we don't know yet whether it's Turkish. So there's a discrepancy, I becomes i. Considering this, maybe it's better to purposely convert the training text with the English rules and "break it".

To break the tie between Turkish and other Turkic languages written in the Latin alphabet (Turkmen), the i could be consulted again (to my knowledge only Turkish uses the letters İ and ı).

-6. scoring ""For each letter sequence, the scoring uses the 3-6 most likely languages and their quantized log probabilities." seems like a good optimization."

The training texts don't have the same sizes. It would be impractical to force all to be the same length. Therefore, if the French text was generated from 1000 pages, whereas for Occitan there were only 20, it would be unfair to weight the extracted n-grams equally.

For a text of a certain minimal length, all languages in the system using that script (eg Latin) will have some matching n-grams. Therefore there is a large list of languages with some points. It makes no sense to keep the long tail of low-chance languages. Only keeping the best 3-6 per word or phrase makes sense.

(the numbering in this markup was outsmarting me too...)

dennis97519 commented 9 years ago

1- Yes, as in, in my case I have tested 100 texts for 40 languages with 1,2,3-gram and 4-gram respectively. The accuracies of 3-gram and 4-gram are very similar, both reaching about 100% when the text is long enough.

5- Maybe the occurrence of ß can indicate German, and the occurrence of ı can indicate Turkish :) and maybe all I and ı can be converted into i, like you said, break it.

6- I see, thanks.