Chinese breaks multi language detection

pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text

Apache License 2.0

1.08k stars 44 forks source link

Chinese breaks multi language detection #143

Closed LutzSteinborn closed 11 months ago

LutzSteinborn commented 1 year ago

Hello, it looks like that chinese in a text breaks multi language detection. I know: its experimental, but it works most of the time pretty good. Example: `text="Płaszczowo-rurowe wymienniki ciepła Uszczelkowe der blaue himmel über berlin 中文 the quick brown fox jumps over the lazy dog"

detector=LanguageDetectorBuilder.from_languages(Language.ENGLISH, Language.GERMAN, Language.POLISH).build() detector.detect_multiple_languages_of(text) [DetectionResult(start_index=0, end_index=48, word_count=4, language=Language.POLISH), DetectionResult(start_index=48, end_index=77, word_count=5, language=Language.GERMAN)] `

pemistahl commented 1 year ago

Hi Lutz, thank you for your report. I will try to improve the multi language detection algorithm in future releases. Your example might help in this respect.

pemistahl commented 11 months ago

It turns out that the cause of this issue is the same as for the issue #154 which I've just fixed with commit 67fdebc30610cc11d0b0f02756adfe62d77df247.

The output of your code after the fix is:

[
  DetectionResult(start_index=0, end_index=48, word_count=4, language=Language.POLISH), 
  DetectionResult(start_index=48, end_index=80, word_count=7, language=Language.GERMAN), 
  DetectionResult(start_index=80, end_index=123, word_count=9, language=Language.ENGLISH)
]

POLISH Płaszczowo-rurowe wymienniki ciepła Uszczelkowe 
GERMAN der blaue himmel über berlin 中文 
ENGLISH the quick brown fox jumps over the lazy dog