Closed MaciejGorczyca closed 4 years ago
Hi @MaciejGorczyca. Indeed, this is a bug. It only happens for input expressions that do not contain any whitespace. Input such as "기모링 a"
or "기모 a링"
returns Korean. I think it's not quickly fixable. I will need some time for this one because it affects some of the main internal logic.
I'm gonna deal with it as soon as possible. I'll let you know.
I have just fixed this bug by changing how probabilities of zero lead to the removal of certain languages during the detection process. Unfortunately, this decreases the overall accuracy slightly but not to a great extent.
@MaciejGorczyca can you please test whether this change makes you happy? If so, I will release this bug fix in the next version 0.6.0. In case you cannot wait so long, please build the snapshot version of this repository yourself until the release is done. I will leave this issue open until then as well. Thank you.
This has been fixed now, I'm closing this issue.
Hey.
I just wanted to let you know that detection of CJK languages doesn't work when there is at least one non-CJK character. For example, this will work:
languageDetector.detectLanguageOf("기모링");
While this won't work:
languageDetector.detectLanguageOf("기모링a");
or
languageDetector.detectLanguageOf("기모링~");
Instead of getting Language.KOREAN, we get Language.UNKNOWN.
Can you please confirm if this is the case on your end and if there is an easy fix? I quickly checked the method and it seems that check for CJK doesn't correctly pick it up and then "summedUpProbabilities" has value 0 for all languages. It seems that there are json models for korean etc so it should pick it up normally and give scores for all languages based on that but it doesn't.
If it can't be reliably fixed without throwing tens of hours, boolean argument to agressively check for CJK would be great (remove all a-zA-Z characters, all special characters, split input into more sentences and if one of them is detected as CJK then return that language etc).