pemistahl / lingua-rs

The most accurate natural language detection library for Rust, suitable for short text and mixed-language text
Apache License 2.0
892 stars 41 forks source link

Bias towards non-English languages? #270

Open Mrodent opened 12 months ago

Mrodent commented 12 months ago

I won't bombard you with any more issues. I just think this crate is really excellent and am excited by it. It's going to make my Elasticsearch indices and my use of them much better.

So most of the strings I'm subjecting to analysis are in the range 100 chars to maybe 1000 chars.

I have quite a few bilingual documents in my corpus of documents, almost all between English and some other language. Usually with English in one column and the other language in the other. So parsing the document tends to produce quite a bit of text with, say, Irish and English mixed.

In those cases Irish almost always seems to be chosen as "language with the highest confidence". So then I thought I'd examine the levels of confidence for all 6 languages for all these bilingual Irish-English strings. To my surprise, it is usually Irish 1.0 and English 0.0! Or occasionally Irish 0.88 and English 0.09, something like that.

This tends to suggest that if a non-English language is detected it is given a higher "weighting" than English.

But the thing is, if you are offering multiple-language detection (which I realise is an experimental feature at this stage), having a bias against any language in this way is a bit unfortunate: it means that it is harder to identify strings where there appear to be runs of more than one language, so you can then change to detect_multiple_languages_of for more detailed analysis.

I'd be interested to hear what you have to say about this. Meanwhile I may well clone your app and see if there are any obvious ways I might be able to tweak things a bit to address some of the issues I have currently.

pemistahl commented 12 months ago

Can you please give me some examples for the Irish-English strings? That would make it easier for me to examine what's going on.

Generally, there is no rule in the library's rule engine that explicitly underweights English. The rule engine looks for characters in the texts that are unique to one or more languages and then adds more weight to those languages. The following characters are treated as potential indicators for Irish (but also for a few others): Áá Éé Íí Óó Úú

Another factor might be the ngram probabilities. If the sum of ngram probabilities for Irish is larger than the sum of ngram probabilities for English, then Irish will be returned. If you give me some examples, I can tell you whether the rule engine or the statistical model is decisive for them.

I won't bombard you with any more issues.

No worries, you are welcome. I'm always happy about feedback, especially if it is as friendly as yours. :)

I just think this crate is really excellent and am excited by it.

Thank you very much. :) Feel free to open a pull request if you think that you have found useful optimizations.

Mrodent commented 11 months ago

Thanks. Here's an example of some print-outs. These are bilingual documents (attempt by me in fact to translate Alice in Wonderland from English to Irish). So in one column there's the source (English) and in the other column the Irish translation.

You can see there are distinct runs of proper text in English and Irish (not just jottings). This would obviously be an ideal candidate for using detect_multiple_languages_of. But it's giving me the highest possible level of confidence for Irish, which obviously means that the algorithm has nothing to indicate that these text segments should be examined for in-string language changes.

12:40:39.436 | INFO  | src\text_document.rs:217
found Irish, confidences [(Irish, 1.0), (English, 0.0), (French, 0.0), (German, 0.0), (Latin, 0.0), (Spanish, 0.0)]:
|Down, down, down. Would the fall _never_ come to an end? “I wonder how many miles I’ve fallen by this time?” she said aloud.
Síos! Síos! Síos! Nach mbeadh deireach leis an titim choíche? \N'fheadar cé mhéad míle agus atá mé tite anois?\", a dúirt sí os ard.
feadair: know. \"Used only with negative or interrogative\". n'fheadar: I wonder...
os prep.: over; above
“I must be getting somewhere near the centre of the earth. Let me see: that would be four thousand miles down, I think—”
\"Caithfidh mé a bheith ag fáil áit éigin in aice le lár an domhain. Anois: ceapaim go mbeadh sé ceithre mhíle míle síos...\
domhain f.: world
(for, you see, Alice had learnt several things of this sort in her lessons in the schoolroom, and though this was not a _very_ good opportunity for showing off her knowledge, as there was no one to listen to her, still it was good practice to say it over)
(mar, feiceann tú, bhí cúpla rud den saghas seo foghlamtha ag Alice ina ceachtanna sa seomra scoile, agus, cé nach raibh deis an-mhaith lena cuid eolais a thaispeáint, mar ní raibh aon duine chun éisteacht léi, cé go bfhuil, ba dhea-chleachtas é a rá arís...)
“—yes, that’s about the right distance—but then I wonder what Latitude or Longitude I’ve got to?”
|
12:40:39.447 | INFO  | src\text_document.rs:217
found Irish, confidences [(Irish, 1.0), (English, 0.0), (French, 0.0), (German, 0.0), (Latin, 0.0), (Spanish, 0.0)]:
|“I must be getting somewhere near the centre of the earth. Let me see: that would be four thousand miles down, I think—”
\"Caithfidh mé a bheith ag fáil áit éigin in aice le lár an domhain. Anois: ceapaim go mbeadh sé ceithre mhíle míle síos...\
domhain f.: world
(for, you see, Alice had learnt several things of this sort in her lessons in the schoolroom, and though this was not a _very_ good opportunity for showing off her knowledge, as there was no one to listen to her, still it was good practice to say it over)
(mar, feiceann tú, bhí cúpla rud den saghas seo foghlamtha ag Alice ina ceachtanna sa seomra scoile, agus, cé nach raibh deis an-mhaith lena cuid eolais a thaispeáint, mar ní raibh aon duine chun éisteacht léi, cé go bfhuil, ba dhea-chleachtas é a rá arís...)
“—yes, that’s about the right distance—but then I wonder what Latitude or Longitude I’ve got to?”
\"– sin é, sin theart ar an achar ceart – ach ansin n'fheadar cé hé an domhanleithead nó an domhanfhad atá agam anois?\
 (Alice had no idea what Latitude was, or Longitude either, but thought they were nice grand words to say.)
(Ní raibh aon barúil ag Alice cén domhanleithead a bhí ann, nó cén domhanfhad a bhí ann, ach cheap sí gur iad focla deasa iontacha a rá.)
Presently she began again.
|

Haven't had a chance to look at your source code yet (or to try the new version which you said handles exotic Unicode better)...

By the way (because I haven't had a chance to examine things, I don't know whether you've already factored in this sort of thing)... but maybe when a string is subjected to "multiple language analysis" you should take account of 1) newlines 2) full stops 3) semicolons, as factors likely to increase the likelihood of detecting a language fragment boundary?

willstott101 commented 7 months ago

I think this is quite a different example but one of the failures I have seen is usage of words shared from other languages Just you and me in this digital tête-à-tête, is in my eyes an English sentence, with some french words in. But lingua detects it as French, hence how it feels a little like a bias.

>>> from lingua import LanguageDetectorBuilder
>>> detector = LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build()
>>> confidence_values = detector.compute_language_confidence_values("Just you and me in this digital tête-à-tête,")
>>> confidence_values
[ConfidenceValue(language=Language.FRENCH, value=0.0935808995725718), ConfidenceValue(language=Language.ENGLISH, value=0.083711183706532), ConfidenceValue(language=Language.DUTCH, value=0.049392064611581105), ConfidenceValue(language=Language.CATALAN, value=0.04836927544723105)...

Interestingly the results are completely off for multi-lang detection of this snippet:

>>> detector.detect_multiple_languages_of("Just you and me in this digital tête-à-tête,")
[DetectionResult(start_index=0, end_index=19, word_count=5, language=Language.FRENCH), DetectionResult(start_index=19, end_index=44, word_count=3, language=Language.ESPERANTO)]
>>> detector.detect_multiple_languages_of("Just you and me in this digital,")
[DetectionResult(start_index=0, end_index=32, word_count=7, language=Language.ENGLISH)]
>>> detector.detect_multiple_languages_of("tête-à-tête,")
[DetectionResult(start_index=0, end_index=12, word_count=3, language=Language.FRENCH)]

It recognises the sentence without the french words easily as English, then the isolated words quickly throw lingua off.

I'm using the python bindings here (2.0.2) - I hope that doesn't affect the relevance to this discussion.