Interpolation with perplexity optimization fails with large models

GoogleCodeExporter commented 9 years ago

I have two 4-gram models, 191567768 and 38095008 bytes in MITLM binary format. 

When I use interpolate-ngram to LI them without any perplexity
optimization, it works fine. However, when I add --optimize-perplexity
option, I get segmentation fault.

This is what happens:

$ ~/lbin/mitlm-svn/interpolate-ngram -l build/lm/tmp/model1.mitlm
model2.mitlm -o 4 --optimize-perplexity dev.txt -write-lm out.arpa.gz
Loading component LM model1.mitlm...
Loading component LM model2.mitlm...
Interpolating component LMs...
Interpolation Method = LI
Loading development set dev.txt...
Optimizing 1 parameters...
Segmentation fault (core dumped)

Backtrace from gdb:

(gdb) bt
#0  0x0000000000430c9c in InterpolatedNgramLM::_EstimateProbsMasked
(this=0x7fffa952a110, params=@0x7fffa952a210, pMask=0x5b4280) at
src/InterpolatedNgramLM.cpp:342
#1  0x00000000004316dd in InterpolatedNgramLM::Estimate
(this=0x7fffa952a110, params=@0x7fffa952a3a0, pMask=0x5b4280) at
src/InterpolatedNgramLM.cpp:214
#2  0x0000000000441f7a in PerplexityOptimizer::ComputeEntropy
(this=0x7fffa952a250, params=@0x7fffa952a3a0) at src/PerplexityOptimizer.cpp:61
#3  0x0000000000443381 in
PerplexityOptimizer::ComputeEntropyFunc::operator() (this=0x7fffa9529f90,
params=@0x7fffa952a3a0) at src/PerplexityOptimizer.h:64
#4  0x0000000000445076 in
MinimizeLBFGSB<PerplexityOptimizer::ComputeEntropyFunc>
(func=@0x7fffa9529f90, x=@0x7fffa952a3a0, numIter=@0x7fffa9529f8c,
step=1e-08, factr=10000000,
    pgtol=1.0000000000000001e-05, maxIter=15000) at src/optimize/LBFGSB.h:79
#5  0x0000000000442643 in PerplexityOptimizer::Optimize
(this=0x7fffa952a250, params=@0x7fffa952a3a0, technique=LBFGSBOptimization)
at src/PerplexityOptimizer.cpp:122
#6  0x000000000046db3e in main (argc=10, argv=0x7fffa952aab8) at
src/interpolate-ngram.cpp:277

A similar thing happens with -i CM, but it crasher earlier:

$ ~/lbin/mitlm-svn/interpolate-ngram -l build/lm/tmp/model1.mitlm
model2.mitlm -o 4 --optimize-perplexity dev.txt -write-lm out.arpa.gz -i CM
Interpolating component LMs...
Interpolation Method = CM
Loading counts for model1.mitlm from log:model1.counts...
Loading counts for model2.mitlm from log:model2.counts...
Loading development set dev.txt...
Segmentation fault (core dumped)

(gdb) bt
#0  0x0000000000429924 in Copy<unsigned char const*, unsigned char*>
(input=0xa0 <Address 0xa0 out of bounds>, begin=0x2aaad1028010 "",
end=0x2aaad15da9f0 "")
    at src/util/FastIO.h:56
#1  0x000000000042db71 in DenseVector<unsigned char>::operator=
(this=0x5b4810, v=@0x5b3ea0) at src/vector/DenseVector.tcc:146
#2  0x0000000000431974 in InterpolatedNgramLM::GetMask
(this=0x7fff0e36af40, probMaskVectors=@0x7fff0e36ad30,
bowMaskVectors=@0x7fff0e36ad10) at src/InterpolatedNgramLM.cpp:153
#3  0x0000000000442c6e in PerplexityOptimizer::LoadCorpus
(this=0x7fff0e36b080, corpusFile=@0x7fff0e36b2f0) at
src/PerplexityOptimizer.cpp:55
#4  0x000000000046db01 in main (argc=12, argv=0x7fff0e36b8e8) at
src/interpolate-ngram.cpp:274

BTW, the same thing works with small toy models.

I'm using MTLM from SVN, Linux, amd64.

Original issue reported on code.google.com by alu...@gmail.com on 10 Dec 2008 at 1:41

GoogleCodeExporter commented 9 years ago

Hi alumae,

I'll admit that I didn't test the C++ implementation with binary LM and 
non-order-3
models enough (I had a hybrid Python implementation previously that went 
through more
usage).  While I try to figure what could go wrong from the stack trace, can 
you try
the same experiment, but with order 3 LMs and ARPA LM inputs?  Thanks.

Paul

Original comment by bojune...@gmail.com on 10 Dec 2008 at 2:59

GoogleCodeExporter commented 9 years ago

For long argument names, you need 2 dashes.  One dash indicates a sequence of
one-letter arguments since I used the boost::program_options package to parse 
the
input arguments.  (I probably made the same mistake in my documentation.)

interpolate-ngram -l build/lm/tmp/model1.mitlm
model2.mitlm -o 4 --optimize-perplexity dev.txt --write-lm out.arpa.gz

Original comment by bojune...@gmail.com on 10 Dec 2008 at 3:04

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

- Need to create PerplexityOptimizer/WordErrorRateOptimizer with the same n-gram
order as the model being optimized.
- Current code is not robust enough to optimize using mismatched order.  (Issue 
5)
- SVN Revision 18.

Original comment by bojune...@gmail.com on 10 Dec 2008 at 3:23

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

With trigrams and ARPA format it works (LI and CM, didn't test GLI).

Original comment by alu...@gmail.com on 10 Dec 2008 at 3:28

GoogleCodeExporter commented 9 years ago

I should also warn you that the current recipe for count merging is not 
completely
correct since I made the mistake of assuming that c(h) = sum_w c(h w), which is 
not
true for Kneser-Ney smoothing as it modifies the lower order counts.  The 
results
should not be significantly different though and we still get a valid n-gram 
model.

I'll provide an updated recipe hopefully in a week or two.

Original comment by bojune...@gmail.com on 10 Dec 2008 at 3:36

GoogleCodeExporter commented 9 years ago

OK, thanks! I confirm LI and CM works now with ARPA 4-grams, didn't test with 
binary
format.

Original comment by alu...@gmail.com on 10 Dec 2008 at 3:59

simon-joseph / mitlm

Interpolation with perplexity optimization fails with large models #4