pemistahl / lingua

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Apache License 2.0
689 stars 61 forks source link

Language recognition error #159

Open xujiaw opened 1 year ago

xujiaw commented 1 year ago
LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH,CHINESE , THAI, VIETNAMESE).build();
SortedMap<Language, Double> languageDoubleSortedMap = detector.computeLanguageConfidenceValues("ี่มีประสิทธิภาพหลอดไฟพลังงานแสงอาทิตย์กลางแจ้งเซ็นเซอร์ตรวจจับการเคลื่อนไหวสวนกันน้ำ LED พลังงานแสงอาทิตย์โคมไฟสปอร์ตไลท์สำหรับ Garden เส้นทางถนนแบ็คดรอปเป่าลม Led Light");
System.out.println(languageDoubleSortedMap);

The following information is printed : {ENGLISH=1.0, VIETNAMESE=0.5658177137374878} I think it's Thai, but I can recognize English, even Vietnamese, and Thai doesn't version is : 1.2.2

pbcornelius commented 1 year ago

I'm not sure if it's helpful, but I also encountered some fairly straight-forward misclassifications:

Good Luck Sarah ... "break a leg!"

TAGALOG=1.0, ENGLISH=0.9973366856575012, GERMAN=0.9332742094993591, ...

Thank you, Krista!

FINNISH=1.0, ENGLISH=0.9905743598937988, ESPERANTO=0.9733119606971741

@Evar So exciting!

TAGALOG=1.0, ENGLISH=0.9913086891174316, ESPERANTO=0.9132570028305054

To me, these do not seem like border-line cases (e.g., shared words across languages).