smilli / berkeleylm

Automatically exported from code.google.com/p/berkeleylm
1 stars 1 forks source link

ArrayIndexOutOfBoundsException while calling getLogProb #10

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

I have two ngram language models, A and B. B is a 3-gram LM trained on a 
super-set of the data used to train the 5-gram LM A. When I use B to estimate 
the likelihood of some sequences, the following exception is raised very 
frequently:

java.lang.ArrayIndexOutOfBoundsException: 2
    at edu.berkeley.nlp.lm.map.HashNgramMap.getOffsetHelpFromMap(HashNgramMap.java:405)
    at edu.berkeley.nlp.lm.map.HashNgramMap.getOffsetForContextEncoding(HashNgramMap.java:396)
    at edu.berkeley.nlp.lm.map.HashNgramMap.getValueAndOffset(HashNgramMap.java:294)
    at edu.berkeley.nlp.lm.ArrayEncodedProbBackoffLm.getBackoffSum(ArrayEncodedProbBackoffLm.java:133)
    at edu.berkeley.nlp.lm.ArrayEncodedProbBackoffLm.getLogProb(ArrayEncodedProbBackoffLm.java:97)
    at edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel$DefaultImplementations.getLogProb(ArrayEncodedNgramLanguageModel.java:65)
    at edu.berkeley.nlp.lm.ArrayEncodedProbBackoffLm.getLogProb(ArrayEncodedProbBackoffLm.java:163)

The exception is not raised when using A.
Interestingly, when using B the exception is not _always_ raised, also for very 
similar strings. For example, the string:

"till you drive over the telly ."

does not generate an exception, while

"till you drive over the failure ."

does.

Even though it should not be relevant, both "telly" and "failure" are observed 
unigrams.

I am using berkeleylm 1.1.2 on OSX 10.8.2.
java -version:
 java version "1.6.0_37"
 Java(TM) SE Runtime Environment (build 1.6.0_37-b06-434-11M3909)
 Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01-434, mixed mode)

Both language models are estimated with make-kneserney-arpa-from-raw-text and 
subsequently converted to binary using make-binary-from-arpa. 

The problematic language model is quite large, so uploading it for testing 
could be complicated. I am wondering whether anyone has ever observed a similar 
error and has any clue about the cause of the problem.

Thanks!

Original issue reported on code.google.com by daniele....@gmail.com on 3 Feb 2013 at 2:40

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
A small addition. I have used half of the data in B to train a 3-gram and a 
4-gram model. The 3-gram model exhibits the same kind of problematic behaviour, 
whereas the 4-gram model works smoothly. The problem, then,  seems to be 
somehow related to the order of the model.

Original comment by daniele....@gmail.com on 3 Feb 2013 at 3:23

GoogleCodeExporter commented 9 years ago
I think the problem is that you calling getLogProb with an n-gram that is 
longer than the order of the LM. I failed to provide appropriate documentation 
or decent error messages about this, so apologies on my part. But it's actually 
not quite clear what the user wants in this case: do you want me to score the 
n-gram in a scrolling window (a la NgramLanguageModel.scoreSequence), or just 
ignore the unused words of context? 

In any case, can you please confirm that this is the issue? In parallel, I will 
add documentation and some improved error messages.

Original comment by adpa...@gmail.com on 3 Feb 2013 at 6:21

GoogleCodeExporter commented 9 years ago

Thanks for the fast reply!

Yes, I am thinking abut the scrolling window behaviour. On the other hand, how 
come that some sequences of the same length can be scored without problems, 
whereas others cannot? I would expect the exception to be generated in all 
cases in which a sequence is longer than the order. 

Original comment by daniele....@gmail.com on 3 Feb 2013 at 6:33

GoogleCodeExporter commented 9 years ago
Right. I think the reason it doesn't always fail is that the lookup first finds 
the longest matching suffix from right to left, then computes whatever backoffs 
are left over. It's possible to match a 3-gram suffix and have a 2-gram backoff 
left over, so that the code never looks up a 4-gram, even though it was a 
called on a 5-gram.

Original comment by adpa...@gmail.com on 4 Feb 2013 at 5:04

GoogleCodeExporter commented 9 years ago
I see, thanks for the clarification. I implemented the moving window
behavior and I the failures are resolved.

 Daniele

Original comment by daniele....@gmail.com on 4 Feb 2013 at 5:07

GoogleCodeExporter commented 9 years ago
I have changed the behaviour to ignore extra words of context, and added some 
documentation to reflect this. 

Original comment by adpa...@gmail.com on 9 Feb 2013 at 5:29

GoogleCodeExporter commented 9 years ago
Thanks! :)

Original comment by daniele....@gmail.com on 9 Feb 2013 at 9:44