pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.02k stars 43 forks source link

Offsets incorrect #192

Closed boltonn closed 7 months ago

boltonn commented 7 months ago

Below are two minimal examples where the offsets do not seem correct. In the first it correctly identifies both languages but gives start_index past the length of the text and in the second example the result does not match the README. If you run it on a document of 2k characters the end_index has been as much as 7k.

detector = LanguageDetectorBuilder.from_all_spoken_languages().with_low_accuracy_mode().build()
text = "他能在多大程度上对此施加影响是很重要的,因为无论结果如何,他都将难脱干系。\n\n相关主题内容\nThis is an example English sentence."
dets = model.detect_multiple_languages_of(text)
for result in detector.detect_multiple_languages_of(text):
     print(f"{}: '{text[result.start_index:result.end_index]}'")

languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
sentence = "Parlez-vous français? " + \
    "Ich spreche Französisch nur ein bisschen. " + \
    "A little bit is better than nothing."
for result in detector.detect_multiple_languages_of(sentence):
     print(f"{}: '{sentence[result.start_index:result.end_index]}'")
# example 1 output
CHINESE: '他能在多大程度上对此施加影响是很重要的,因为无论结果如何,他都将难脱干系。

This is an example English sentence.'

# example 2 output
FRENCH: 'Parlez-vous français? I'
GERMAN: 'ch spreche Französisch nur ein bisschen. A '
ENGLISH: 'little bit is better than nothing.'
pemistahl commented 7 months ago

Thank you for the bug report. I should have added sentences to the unit tests that contain characters consisting of multiple bytes. Stupid me, I forgot that Rust indices are byte indices but Python indices are character indices. So the indices need to be converted. I'm sorry, I'm going to fix it as soon as possible.

boltonn commented 7 months ago

Oh, no worries at all. This is such an awesome repository. Thanks for the great work !

pemistahl commented 7 months ago

In the meantime, you can use the latest 1.3 release. This is the pure Python implementation where the indices are handled correctly. It's just slower than version 2.0.

pemistahl commented 7 months ago

@boltonn I've fixed the bug now in version 2.0.1. See commit if you are interested in the details. Please try again. Feel free to open a new issue if you encounter other problems. Thanks again.