Open elnikkis opened 1 month ago
This should definitely not happen. Thanks for letting me know. Does this happen with other languages, too? I will try to reproduce and fix it.
I noticed two additional points:
(1) In addition to the text shown in the initial issue, I found that misclassification also occurs with the following input texts. What these have in common is that they all contain emoticons. The symbols in these emoticons include characters that are not Japanese.
ご尊顔(ㅅ˘ㅂ˘)✨✨
考え過ぎ(´・ω・`)
大往生でしょ(´・ω・`)
大正義ダンテ98(´・ω・`)
(2) When preparing the detector, I tried four different patterns. The fourth detector works correctly:
detector = LanguageDetectorBuilder.from_all_languages().build() # buggy
detector = LanguageDetectorBuilder.from_all_spoken_languages().build() # buggy
detector = LanguageDetectorBuilder.from_languages(*Language.all()).build() # buggy
detector = LanguageDetectorBuilder.from_languages(Language.JAPANESE, Language.CHINESE).build() # consistent result
This suggests that there might be an issue with the detection data for one or more languages.
I hope this information is helpful to you.
I'm encountering an issue with the randomness of results when using this library for language detection. The following code produces different detection results each time it's executed.
Code:
Output:
This variability in results is problematic for reproducibility. Is there a way to fix or stabilize the output so that it remains consistent across multiple runs?
Version information: