Closed jordimas closed 1 year ago
Thanks, you seem to be right about the characters "Áá". I will exclude Catalan from the filter for these characters, even though it won't make much of a difference for the overall detection accuracy of the library.
Which other mappings are wrong in your opinion?
I was surprised to see "ç" in Basque language.
See: https://en.wikipedia.org/wiki/Basque_alphabet
"Although ⟨c, ç, q, v, w, y⟩ are not used in traditional Basque language words, they were included in the Basque alphabet for writing loanwords"
Ç (cdedilla) is only used in Basque for loanwords (specially from Catalan, like surnames or names of places). I do not think that is a strong signal to say that a word is in Basque.
I'm not native German speaker, but also the Ç in German: https://github.com/pemistahl/lingua-py/blob/502bb9abef2a31b841c49e063f1a0bd7e47af86d/lingua/_constant.py#L159
What's is the source of the data used to build these rules?
As far as I remember now, I added those extra characters to the languages because of loanwords. That's why I added the characters "Áá" to Catalan as well. This language filter only restricts the set of possible language for a given text. It does not automatically classify a word with "Áá" as Catalan, for instance. So leaving the characters in the mapping won't do any harm as the language is eventually determined by the statistics engine.
What's is the source of the data used to build these rules?
Wikipedia and the Wortschatz corpora from Leipzig university.
Ok. Thanks. Feel free to close the ticket
I found useful the Unicode data https://unicode-org.github.io/cldr-staging/charts/latest/summary/ca.html that explains which chars each language uses.
Thank you for the link to the Unicode data. The site looks useful. I've bookmarked it.
Hello! My understading is that this mapping:
https://github.com/pemistahl/lingua-py/blob/502bb9abef2a31b841c49e063f1a0bd7e47af86d/lingua/_constant.py#L34
It's used by the rule system to identity languages based on chars. Is my assumption correct?
Looking at this: https://github.com/pemistahl/lingua-py/blob/502bb9abef2a31b841c49e063f1a0bd7e47af86d/lingua/_constant.py#L191
Catalan language for example does NOT have "Áá" as valid chars (see reference https://en.wikipedia.org/wiki/Catalan_orthography#Alphabet).
Looking at the data I see other mappings that do not seem right.
May be the case that these mappings can be improved?