Some character to language mappings are incorrect

pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text

Apache License 2.0

1.1k stars 44 forks source link

Some character to language mappings are incorrect #103

Closed jordimas closed 1 year ago

jordimas commented 1 year ago

Hello! My understading is that this mapping:

https://github.com/pemistahl/lingua-py/blob/502bb9abef2a31b841c49e063f1a0bd7e47af86d/lingua/_constant.py#L34

It's used by the rule system to identity languages based on chars. Is my assumption correct?

Looking at this: https://github.com/pemistahl/lingua-py/blob/502bb9abef2a31b841c49e063f1a0bd7e47af86d/lingua/_constant.py#L191

Catalan language for example does NOT have "Áá" as valid chars (see reference https://en.wikipedia.org/wiki/Catalan_orthography#Alphabet).

Looking at the data I see other mappings that do not seem right.

May be the case that these mappings can be improved?

pemistahl commented 1 year ago

Thanks, you seem to be right about the characters "Áá". I will exclude Catalan from the filter for these characters, even though it won't make much of a difference for the overall detection accuracy of the library.

Which other mappings are wrong in your opinion?

jordimas commented 1 year ago

I was surprised to see "ç" in Basque language.

See: https://en.wikipedia.org/wiki/Basque_alphabet

"Although ⟨c, ç, q, v, w, y⟩ are not used in traditional Basque language words, they were included in the Basque alphabet for writing loanwords"

Ç (cdedilla) is only used in Basque for loanwords (specially from Catalan, like surnames or names of places). I do not think that is a strong signal to say that a word is in Basque.

I'm not native German speaker, but also the Ç in German: https://github.com/pemistahl/lingua-py/blob/502bb9abef2a31b841c49e063f1a0bd7e47af86d/lingua/_constant.py#L159

What's is the source of the data used to build these rules?

pemistahl commented 1 year ago

As far as I remember now, I added those extra characters to the languages because of loanwords. That's why I added the characters "Áá" to Catalan as well. This language filter only restricts the set of possible language for a given text. It does not automatically classify a word with "Áá" as Catalan, for instance. So leaving the characters in the mapping won't do any harm as the language is eventually determined by the statistics engine.

What's is the source of the data used to build these rules?

Wikipedia and the Wortschatz corpora from Leipzig university.

jordimas commented 1 year ago

Ok. Thanks. Feel free to close the ticket

I found useful the Unicode data https://unicode-org.github.io/cldr-staging/charts/latest/summary/ca.html that explains which chars each language uses.

pemistahl commented 1 year ago

Thank you for the link to the Unicode data. The site looks useful. I've bookmarked it.