malcolmgreaves / language-detection

Automatically exported from code.google.com/p/language-detection . Some after-the-fact modifications to get this working within sbt.
Apache License 2.0
5 stars 5 forks source link

Profile generation problem #21

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hello,

I have successfully generated profiles for MT, LV, SL, ET and LT (based on the 
steps you mentioned in the Wiki (cf. Tools section). When I test the language 
detection with only added profiles MT, LV, SL, ET, these 4 new profiles are 
correctly loaded (now error appears). But, when I add the LT profile I get the 
error:

GRAVE: Error
java.lang.ArrayIndexOutOfBoundsException: -1
    at com.cybozu.labs.langdetect.DetectorFactory.addProfile(DetectorFactory.java:105)
    at com.cybozu.labs.langdetect.DetectorFactory.loadProfile(DetectorFactory.java:75)
    at Main.qualityCkeck(Main.java:215)
    at Main.main(Main.java:62)

Please see enclosed the files (profiles and Wiki abstract only for LT). I use 
language-detection API version 05-09-2011.

Regards,
Emmanuel

Original issue reported on code.google.com by zygolech...@gmail.com on 29 Aug 2011 at 9:13

Attachments:

GoogleCodeExporter commented 9 years ago
After several tests it seems that the error also occurs with the LV profile. 
See enclosed the Wiki abstract for LV.

Original comment by zygolech...@gmail.com on 29 Aug 2011 at 2:15

Attachments:

GoogleCodeExporter commented 9 years ago
DetectorFactory.loadProfile is to load all profiles at once only ( I had 
intended to write check code for that, but there are no such code... ).
So if your code call loadProfile multiple times, try to put all profiles in one 
directory and call loadProfile once only.

Thanks.

Original comment by nakatani.shuyo on 30 Aug 2011 at 3:20

GoogleCodeExporter commented 9 years ago
It is already what I have done. The code DetectorFactory.loadProfile(...) is 
called only once at the beginning of the processing (if I don't do this, I get 
the LangDetectException("duplicate the same language profile")).

I think there is a problem with the LT or LV profiles. Because when I remove 
these profiles from the profiles directory, there is no problem at all.

After having decompiled the JAR you provided (to have the right line mentioned 
by the exception) it seems that the error occurs on the line:
      double prob = ((Integer)profile.freq.get(word)).doubleValue() / profile.n_words[(word.length() - 1)];
(certainly because the word.length() is equal to 0 (zero)).

After investigating, it seems that the word (with frequency) ["":23] was the 
problem. I have removed this from the LV profile and it seems to work. Do you 
know why it came? Perhaps it is a good idea to test the length of the word?

Indeed, now every EU language have a working profile. Do you want me to send 
the MT, SL, ET, LT and LV profiles in order to include these in your next 
release?

Regards,
Emmanuel

Original comment by zygolech...@gmail.com on 30 Aug 2011 at 7:51

GoogleCodeExporter commented 9 years ago
I see...
It is probably a bug of genprofile to generate ["":23]-like feature.
I'll fix it on the next update.
Very thanks!

> Do you want me to send the MT, SL, ET, LT and LV profiles in order to include 
these in your next release?

My policy is to provided the profiles only which is verified with test data and 
I didn't have test data for such languages, so I couldn't provide them...
But I am preparing test data of some languages(LT and so on).

Original comment by nakatani.shuyo on 30 Aug 2011 at 9:45

GoogleCodeExporter commented 9 years ago
Your policy is the right one :-)

Regards,
Emmanuel

Original comment by zygolech...@gmail.com on 30 Aug 2011 at 10:00

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r100.

Original comment by nakatani.shuyo on 8 Sep 2011 at 10:27

GoogleCodeExporter commented 9 years ago
This issue was closed by revision 28880cd7672f.

Original comment by nakatani.shuyo on 12 Jan 2012 at 9:47