MakeKneserNeyArpaFromText throws ArrayIndexOutOfBoundsException

GoogleCodeExporter commented 9 years ago

I am running edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText on some German 
text but keep running into an ArrayIndexOutOfBoundsException exception. If I 
try to build a model from very limited data no such error arises. Is there a 
limit on the number of distinct characters the input text can contain? The out 
of bounds array value is 256 which is suspiciously the size of a byte.

I have attached the input file (German wikipedia data prepared for a character 
level n-gram model).

Here is the output I am seeing:

Reading text files [de-test.txt] and writing to file en-test.model {
    Reading from files [de-test.txt] {
        On line 0
        Writing ARPA {
            On order 1
            Writing line 0
            On order 2
            Writing line 0
            On order 3
            Writing line 0
            Writing line 0
            On order 4
            Writing line 0
[WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
    at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 256
    at java.lang.Long.valueOf(Long.java:548)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:132)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:113)
    at edu.berkeley.nlp.lm.collections.Iterators$Transform.next(Iterators.java:107)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.writeToPrintWriter(KneserNeyLmReaderCallback.java:130)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.cleanup(KneserNeyLmReaderCallback.java:111)
    at edu.berkeley.nlp.lm.io.TextReader.countNgrams(TextReader.java:85)
    at edu.berkeley.nlp.lm.io.TextReader.readFromFiles(TextReader.java:51)
    at edu.berkeley.nlp.lm.io.TextReader.parse(TextReader.java:44)
    at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:280)
    at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:55)

Original issue reported on code.google.com by hhohw...@shutterstock.com on 9 Aug 2012 at 4:48

Attachments:

de-test.txt

GoogleCodeExporter commented 9 years ago

Hi,

Interesting. When I run on that file, there is an exception from a bug (which I 
have fixed), but it is not that exception. That stack trace looks an awful lot 
like the caching inside the java builtin Long class is doing funny things -- 
might it have something to do with your ExecJavaMojo calling things through 
reflection?

In any case, I have fixed the big and am running some tests before I release a 
fix. 1.1.1 should be out by tomorrow.

Original comment by adpa...@gmail.com on 9 Aug 2012 at 5:31

Changed state: Started

GoogleCodeExporter commented 9 years ago

Hi,

Thanks for looking into the issue so quickly.

Interesting that you don't see the same exception. I assume that since
berkeleylm in written in Java it should support input encoded in UTF-8. Is
that a fair assumption?

I have tried calling the program through maven (I imported all the source)
and also without using maven at all and see the same exception in both
cases which is a bit odd if it is caused by reflection.

Original comment by hhohw...@shutterstock.com on 9 Aug 2012 at 5:43

GoogleCodeExporter commented 9 years ago

UTF-8 should be fine. Hopefully the fix I've committed will resolve your issue 
in any case.

Original comment by adpa...@gmail.com on 9 Aug 2012 at 7:33

GoogleCodeExporter commented 9 years ago

Apologies, I fell asleep on this fix. Version 1.1.1 has been uploaded. Let me 
know if this doesn't fix your issue.

Original comment by adpa...@gmail.com on 13 Aug 2012 at 2:02

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I unzipped the new 1.1.1 code but unfortunately am still seeing the same 
ArrayIndexOutOfBoundsException. I have tried on a different input data set in 
case that was the problem (en-test.txt, attached below) but I see the same 
problem on that input.

Here's the steps I took to produce the error:

1. Unzip the code
2. cd to the top level directory, berkeleylm-1.1.1
3. Run ant from the top level directory
4. From the top level directory, run:
java -cp jar/berkeleylm.jar edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText 5 
test-en.model en-test.txt
5. Output is:
Reading text files [en-test.txt] and writing to file test-en.model {
    Reading in ngrams from raw text {
        On line 0
    } [2s]
    Writing Kneser-Ney probabilities {
        Counting counts for order 0 {
        } [0s]
        Counting counts for order 1 {
        } [0s]
        Counting counts for order 2 {
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 256
    at java.lang.Long.valueOf(Long.java:548)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:140)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:121)
    at edu.berkeley.nlp.lm.collections.Iterators$Transform.next(Iterators.java:107)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.parse(KneserNeyLmReaderCallback.java:284)
    at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:299)
    at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:57)

Original comment by hhohw...@shutterstock.com on 15 Aug 2012 at 11:34

Attachments:

en-test.txt

GoogleCodeExporter commented 9 years ago

Followed your steps and did not encounter any exceptions. I'm guessing this is 
a bug in your JVM -- the exception is occurring while boxing a long! You can 
try using a different JVM, or even try using -server (which you should do 
anyway, for speed).

Original comment by adpa...@gmail.com on 15 Aug 2012 at 5:10

GoogleCodeExporter commented 9 years ago

Thanks again for testing this out. It is quite odd that the error comes from 
boxing a long. I ran both with and without -server but saw the exception in 
both cases. I'm going to try a different JVM. Would you mind posting the output 
you get from running "java -version" so that I can start with that 
implementation? I'm using HotSpot 64 bit:

$ java -version
java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)

Thanks for the help.

Original comment by hhohw...@shutterstock.com on 15 Aug 2012 at 5:28

GoogleCodeExporter commented 9 years ago

$ java -version
java version "1.6.0_33"
Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-10M3720)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)

Original comment by adpa...@gmail.com on 15 Aug 2012 at 5:56

GoogleCodeExporter commented 9 years ago

I updated my java-6-sun jvm to 1.6.0_34, I was using a version from 2008. I no 
longer see the exception. Looks like Oracle has been hard at work fixing 
autoboxing issues in the last few years. :)

Original comment by hhohw...@shutterstock.com on 15 Aug 2012 at 8:58

jipson7 / berkeleylm

MakeKneserNeyArpaFromText throws ArrayIndexOutOfBoundsException #8