Closed SteVio89 closed 1 year ago
@SteVio89 You can bypass this bug via rolling your own sentence-splitter using a NLP model and doing lang detection per sentence. Sentence splitting is not as simple as doing punctuation notation split and requires a trained model to do it well. Google and there are many available go sentence splitter models.
Hi @SteVio89, thank you for your request and sorry for my late response.
The tokenization of sentences is not as easy as you think. Punctuation marks are ambiguous, they do not only denote the end of a sentence but also abbreviations etc. Every language does it a bit differently as well. That's why separate sentence-splitting libraries exist, as @diegomontoya has pointed out. My library concentrates on language detection only. If you have a need for preprocessing steps before detecting the language, please use additional libraries which are specialized for it. Thank you.
I took a few tests with lingua as I am interested in the “Detection of multiple languages in mixed-language texts” feature.
I checked the following texts:
Lingua returned the following to me:
First text:
Second text:
Whereby neither Hello in German should be a correct word, nor Hallo in English.
Perhaps a parameter can be added to the DetectMultipleLanguagesOf that ensures that punctuation marks are considered, and only one language is returned per sentence.