pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.02k stars 43 forks source link

Crash on particular emoji with detect_multiple_languages #203

Closed PalmerAL closed 7 months ago

PalmerAL commented 7 months ago

Hi, thanks for writing this library, it's really useful!

I'm seeing a crash with particular emoji input on the latest version installed from PyPI, here's a testcase:

from lingua import Language, LanguageDetectorBuilder
langdetector = LanguageDetectorBuilder.from_all_languages().build()

langdetector.detect_multiple_languages_of('test 🙈')
thread '<unnamed>' panicked at 'byte index 6 is not a char boundary; it is inside '🙈' (bytes 5..9) of `test 🙈`', src/lib.rs:436:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "[...]/crash_repro.py", line 4, in <module>
    langdetector.detect_multiple_languages_of('test 🙈')
pyo3_runtime.PanicException: byte index 6 is not a char boundary; it is inside '🙈' (bytes 5..9) of `test 🙈`
pemistahl commented 7 months ago

Hi @PalmerAL,

Hi, thanks for writing this library, it's really useful!

Nice of you to say that, thank you. :) That motivates me to maintain and improve the library further on.

The cause of your exception is that, whenever detect_multiple_languages_of() returns exactly one DetectionResult, the end index is erroneously calculated as the character offset for Rust. This should be the byte offset instead which then gets converted to character offset for the Python bindings. I'm going to release version 2.0.2 shortly which will fix it.

pemistahl commented 7 months ago

Fixed in https://github.com/pemistahl/lingua-rs/commit/72f2d89da9be38a6c0ed0773b01c35df55c75aee. Will be released as soon as all issues in milestone 2.0.2 have been resolved.

PalmerAL commented 7 months ago

Thanks!