Trying to build a language model on higher-order n-grams.

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. An n-gram dataset in Google Web-IT format, but with no unigrams or bigrams 
(because I am only interested in higher-order n-grams).
2. To conform to the required format, place an empty vocab_cs.gz file under 
subdir "1gms", and create an empty subdir by the name "2gms" with one empty 
file in it called "2gm-0001"
3. The file names under the subdirs for higher-order n-grams do not start with 
<n>gm-0001 (for example, the files under 3gms start with 3gm-0021.

What is the expected output? What do you see instead?
Expected output:
    the expected binary file.
What actually happens:
    after reading and adding the n-grams, the following error is thrown:
    <a really big number> missing suffixes or prefixes were found, doing another pass to add n-grams {
    Exception in thread "main" java.lang.NullPointerException
            at edu.berkeley.nlp.lm.io.LmReaders.buildMapCommon(LmReaders.java:473)
            at edu.berkeley.nlp.lm.io.LmReaders.secondPassGoogle(LmReaders.java:417)
            at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:228)
            at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:204)
            at edu.berkeley.nlp.lm.io.MakeLmBinaryFromGoogle.main(MakeLmBinaryFromGoogle.java:36)

From the source code, I can see that the null pointer exception is thrown at 
the line which says
    numNgramsForEachWord[ngramOrder].incrementCount(headWord, 1);

What version of the product are you using? On what operating system?
    Tried with 1.1.2 and 1.1.5, both on Ubuntu 12.04

Please provide any additional information below.
    I am unable to share the dataset here, but I did manage to reproduce the error by making changes in the folder "/test/edu/berkeley/nlp/lm/io/googledir". These changes are the ones I describe in steps 1, 2 and 3 above. It seems that the empty vocab_cs.gz is what is causing this.

So the core of my question is this:

    What should I do if I only want to build a language model on 3-, 4- and 5-grams?

Original issue reported on code.google.com by ritwik.b...@gmail.com on 21 Nov 2014 at 3:57

GoogleCodeExporter commented 9 years ago

Wanted to add that the NullPointerException persists even when I use the 
original vocab_cs.gz file (instead of the dummy empty file that I initially 
tried).

Original comment by ritwik.b...@gmail.com on 21 Nov 2014 at 4:25

GoogleCodeExporter commented 9 years ago

Okay, so I have tried debugging for a few hours now, but no success yet. Here's 
a toy data I had created for my debugging efforts. Sharing it, in case it 
helps. As far as I can see, it stays true to the Google n-gram format, but 
after adding n-grams, the same NullPointerException is thrown:

120 missing suffixes or prefixes were found, doing another pass to add n-grams {
Exception in thread "main" java.lang.NullPointerException
    at edu.berkeley.nlp.lm.io.LmReaders.buildMapCommon(LmReaders.java:473)
    at edu.berkeley.nlp.lm.io.LmReaders.secondPassGoogle(LmReaders.java:417)
    at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:228)
    at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:204)

Original comment by ritwik.b...@gmail.com on 21 Nov 2014 at 7:57

Attachments:

testdata-ngrams.tar.gz

GoogleCodeExporter commented 9 years ago

I don't intend to support this use case. The code assumes that lower order 
n-grams are available for each higher order n-gram. If you manage to get this 
working yourself, let me know and I'd be happy to patch things in!

Original comment by adpa...@gmail.com on 6 Dec 2014 at 11:51

Changed state: WontFix

thanhan / berkeleylm

Trying to build a language model on higher-order n-grams. #21