jiaxibei2008 / mitlm

Automatically exported from code.google.com/p/mitlm
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Interpolation is broken #6

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago

After updating from SVN, both LI and CM interpolation seem to be broken: in
the interpolated LM, there are many "nans" and most back-off weights are zero.

Sample interpolated LM:

\data\
ngram 1=199992
ngram 2=865062
ngram 3=2657490
ngram 4=4259246

\1-grams:
-1.564484       </s>
-99     <s>     nan

[...]

-5.898262       Abadan
-6.074985       Abadi
-6.242569       Abadia  0.000000
-6.105848       Abadie
-6.242569       Abadou  0.000000

[...]

\2-grams:
nan     </s> -t-il      -0.019559
-2.477506       </s> <UNK>      -0.299104
nan     </s> A  -0.020696
nan     </s> A.
nan     </s> A.B.
nan     </s> A.K.
nan     </s> ABM
nan     </s> ACF
nan     </s> AFP        -0.045175

The source LMs (estimated with estimate-ngram) seem to be OK.

Original issue reported on code.google.com by alu...@gmail.com on 16 Dec 2008 at 10:24

GoogleCodeExporter commented 8 years ago
Hi alumae,

Can you please include the scripts you used to estimate and interpolate the 
models?

Paul

Original comment by bojune...@gmail.com on 16 Dec 2008 at 3:29

GoogleCodeExporter commented 8 years ago
To estimate models, I used:
estimate-ngram --read-text <train.i.txt> -v vocab.txt --use-unknown --smoothing 
ModKN
 -o 4 --write-count <model.i>.arpa.counts --write-lm <model.i>.arpa.gz

To interpolate, I used
interpolate-ngram -l <model.1>.arpa.gz <model.2>.arpa.gz <model.3>.arpa.gz -o 4
--read-parameters interpolate.params -i CM -write-lm final.arpa.gz

The same thing happens when I use simple linear interpolation:
interpolate-ngram -l <model1>.arpa.gz <model2>.arpa.gz <model3>.arpa.gz -o 4
-write-lm final.arpa.gz

There are no such "nans" in the component LMs.

Original comment by alu...@gmail.com on 16 Dec 2008 at 3:46

GoogleCodeExporter commented 8 years ago
I just realized that there are 2-grams such as:

-5.363780       </s> Ababacar   -0.315952
-5.611012       </s> Abassi     -0.225136

in the component LMs, which of course do not make sense. Maybe you are mixing 
begin
and end-of-sentence somewhere?

Original comment by alu...@gmail.com on 16 Dec 2008 at 4:05

GoogleCodeExporter commented 8 years ago
Verified that the problem only exists if --read-vocab and --use-unknown are 
specified.

Original comment by bojune...@gmail.com on 16 Dec 2008 at 4:07

GoogleCodeExporter commented 8 years ago
In the last change, I intentionally merged <s> and </s> together since it 
simplifies
the internal logic and removes a lot of special cases.  As you have noticed, I 
have
not made the LM output completely compatible with SRILM yet.  I do not believe 
this
is the issue though.  I will let you know once I figure out what is going on,
hopefully in an hour or so.

Original comment by bojune...@gmail.com on 16 Dec 2008 at 4:16

GoogleCodeExporter commented 8 years ago
Bug Fixes
=========
- Cleaned up usage of NaN such that it should no longer appear.  Unobserved 
backoff
weights are assumed to be 1, not NaN.
- Only output backoff weight if log value is not 0.
- Cleaned up collapse of <s> and </s> such that ARPA LM loading/saving is 
unaffected.

Original comment by bojune...@gmail.com on 16 Dec 2008 at 7:30

GoogleCodeExporter commented 8 years ago
Thanks, seems to work perfectly.

Original comment by alu...@gmail.com on 17 Dec 2008 at 11:19