Detection of CJK languages doesn't work with at least one non-CJK character

MaciejGorczyca commented 5 years ago

Hey.

I just wanted to let you know that detection of CJK languages doesn't work when there is at least one non-CJK character. For example, this will work:

languageDetector.detectLanguageOf("기모링");

While this won't work:

languageDetector.detectLanguageOf("기모링a");

or

languageDetector.detectLanguageOf("기모링~");

Instead of getting Language.KOREAN, we get Language.UNKNOWN.

Can you please confirm if this is the case on your end and if there is an easy fix? I quickly checked the method and it seems that check for CJK doesn't correctly pick it up and then "summedUpProbabilities" has value 0 for all languages. It seems that there are json models for korean etc so it should pick it up normally and give scores for all languages based on that but it doesn't.

If it can't be reliably fixed without throwing tens of hours, boolean argument to agressively check for CJK would be great (remove all a-zA-Z characters, all special characters, split input into more sentences and if one of them is detected as CJK then return that language etc).

pemistahl commented 5 years ago

Hi @MaciejGorczyca. Indeed, this is a bug. It only happens for input expressions that do not contain any whitespace. Input such as "기모링 a" or "기모 a링" returns Korean. I think it's not quickly fixable. I will need some time for this one because it affects some of the main internal logic.

I'm gonna deal with it as soon as possible. I'll let you know.

pemistahl commented 5 years ago

I have just fixed this bug by changing how probabilities of zero lead to the removal of certain languages during the detection process. Unfortunately, this decreases the overall accuracy slightly but not to a great extent.

@MaciejGorczyca can you please test whether this change makes you happy? If so, I will release this bug fix in the next version 0.6.0. In case you cannot wait so long, please build the snapshot version of this repository yourself until the release is done. I will leave this issue open until then as well. Thank you.

pemistahl commented 4 years ago

This has been fixed now, I'm closing this issue.

pemistahl / lingua

Detection of CJK languages doesn't work with at least one non-CJK character #12