shuyo / language-detection

This is a language detection library implemented in plain Java. (aliases: language identification, language guessing)
https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
731 stars 184 forks source link

Portoguese detection problem #37

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
Input: NO PODEÍS PREPARAR A VUESTROS ALUMNOS PARA QUE CONSTRUYAN MAÑANA EL 
MUNDO DE SUS SUEÑOS SI VOSOTROS YA NO CREÉIS EN ESOS SUEÑOS NO PODEÍS 
PREPARARLOS PARA LA VIDA SINO CREÉIS EN ELLA NO PODRÉIS MOSTRAR EL CAMINO SI 
OS HABEÍS SENTADO CANSADOS Y DESALENTADOS EN LA ENCRUCIJADA CELESTIN FREINET 
FRANCIA 

output: [pt:0.5714263645442876, de:0.428569792470217]

i create through the factory, append and then detect. i dont set a seed.

What is the expected output? What do you see instead?

expected: spanish
result: [pt:0.5714263645442876, de:0.428569792470217]

What version of the product are you using? On what operating system?

latest

I am a bit surprised it would show German. is it the upper case that causes a 
problem? at times i even see German as the main language, i suppose it depends 
on the seed?

thank you!

Original issue reported on code.google.com by thk.k...@gmail.com on 15 Jun 2012 at 10:55

GoogleCodeExporter commented 9 years ago
Hi !!!

I'm facing the same issue and the usage of upper-case characters for the full 
text seems to be the cause of the problem.
A simple workaround consists in converting the full text to lower case. 
We test it on about 150 use-cases (that return bad result on upper-case text) 
and it works for all of them.
From my understanding, the corpus used to create the profile only contains 
upper-case characters when a new sentence begins. That means that the profiles 
define rarely upper-case n-grams with more than one character (and when it is 
the case, the weight is very low).
The profiles could be regenerated using the raw content, the full content 
converted to lower case and the full content converted to upper case and thus 
cover all the use-cases.
An another idea could to make the detection case insensitive by regenerating 
the profiles using the full content converted to lower case and converting 
automatically the submitted text to lower case.
Regards 
Jerome

Original comment by gro...@gmail.com on 6 Feb 2014 at 3:25

GoogleCodeExporter commented 9 years ago
This issue should be renamed by the way.

Original comment by gro...@gmail.com on 6 Feb 2014 at 3:30