Detection of multiple languages strange results

SteVio89 commented 1 year ago

I took a few tests with lingua as I am interested in the “Detection of multiple languages in mixed-language texts” feature.

I checked the following texts:

Hello, I told you the house is green. Hallo, ich habe dir gesagt, das Haus ist grün.

Hallo, ich sage, das Haus ist grün. Hello, I told you the house is green.

Lingua returned the following to me:

First text:

English: 'Hello, I told you the house is green. Hallo, ' German: 'ich habe dir gesagt, das Haus ist grün.'

Second text:

German: 'Hallo, ich sage das Haus ist grün. Hello, I ' English: 'told you the house is green.'

Whereby neither Hello in German should be a correct word, nor Hallo in English.

Perhaps a parameter can be added to the DetectMultipleLanguagesOf that ensures that punctuation marks are considered, and only one language is returned per sentence.

Qubitium commented 1 year ago

@SteVio89 You can bypass this bug via rolling your own sentence-splitter using a NLP model and doing lang detection per sentence. Sentence splitting is not as simple as doing punctuation notation split and requires a trained model to do it well. Google and there are many available go sentence splitter models.

pemistahl commented 1 year ago

Hi @SteVio89, thank you for your request and sorry for my late response.

The tokenization of sentences is not as easy as you think. Punctuation marks are ambiguous, they do not only denote the end of a sentence but also abbreviations etc. Every language does it a bit differently as well. That's why separate sentence-splitting libraries exist, as @diegomontoya has pointed out. My library concentrates on language detection only. If you have a need for preprocessing steps before detecting the language, please use additional libraries which are specialized for it. Thank you.

pemistahl / lingua-go

Detection of multiple languages strange results #36