Open GoogleCodeExporter opened 8 years ago
The main issue here is that when we read in an n-gram containing an OOV word,
we need
to change it to <unk> and cumulate (not replace) the probabilities. The backoff
weights for these n-grams that contain <unk> will also need to be recomputed.
If you can't find a workaround, let me know. Otherwise, it might take me a
while to
get to this as I am transitioning to my new job after grad school.
Original comment by bojune...@gmail.com
on 26 Feb 2009 at 4:52
I also came across the same problem of interpolating open vocabulary language
models.
At first, I created closed vocabulary models and interpolated them using mitlm.
So
the mitlm optimized the weights for the individual language model components:
~/projects/mitlm.0.4/interpolate-ngram \
-lm "corp1.lm, corp2.lm, corp3.lm, corp4.lm, corp5.lm" \
-vocab vocab.txt \
-interpolation LI \
-op corp_dev.txt \
-wl interpolated12345.lm
The output of this command contains:
...
OptParams = [ -3.712318 -5.325820 -1.963826 -1.999009 ]
...
Then I created open vocabulary models with the same parameters, with only "-unk"
added. These models can not be interpolated using mitlm, but srilm can do that,
so
you only need to convert mitlm weights to srilm lambdas. Of course you have to
put
the models into mitlm and srilm in the same order.
-----------------------------
#!/bin/bash
mitlm_interpolation_weights="-3.712318 -5.325820 -1.963826 -1.999009"
all_weights=`octave -q --eval "a=[ $mitlm_interpolation_weights ];
disp(1-sum(exp(a))); disp(exp(a(2:end)));" | tr "\n" " "`
lambda1=`echo $all_weights | awk '{print $1}'`
lambda2=`echo $all_weights | awk '{print $2}'`
lambda3=`echo $all_weights | awk '{print $3}'`
lambda4=`echo $all_weights | awk '{print $4}'`
echo $lambda1 $lambda2 $lambda3 $lambda4
/usr/local/share/Srilm/bin/ngram \
-lm corp1.unk.lm \
-mix-lm corp2.unk.lm \
-mix-lm2 corp3.unk.lm \
-mix-lm3 corp4.unk.lm \
-mix-lm4 corp5.unk.lm \
-lambda $lambda1 \
-mix-lambda2 $lambda2 \
-mix-lambda3 $lambda3 \
-mix-lambda4 $lambda4 \
-unk \
-write-lm interpolated12345.unk.lm
-----------------------------
Please correct me if I am doing something wrong, but for me it seems to work
well.
Miso
Original comment by michal.f...@gmail.com
on 3 Mar 2010 at 1:21
Hmmm, it seems that my previous solution is not correct. Now I do the
interpolation
of open vocabulary models in this way:
1. Build open vocab models with -unk
2. Change <unk> to fake_unk: sed 's/<unk>/fake_unk/g' < lm > lm.fake_unk
3. Interpolate models as closed vocab without -unk
4. Change fake_unk back to <unk>: sed 's/fake_unk/<unk>/g' < interp.lm.fake_unk
>
interp.lm
Few notes about the interpolation parameters:
* ARPA file probabilities are in log10
* N-1 parameters are trained during interpolation (N is number of LMs)
* Weights for individual LMs are: weights = [ 1 exp(param1) exp(param2) ...
exp(paramN-1) ]
* The interpolated probability for a word W is:
* ( prob_LM1(W) * weights(1) + probLM2(W) * weights(2) + ... probLMN(W) *
weights(N) ) / sum(weights)
* ...where probLM1(W) is the probability (not in log anymore) of word W for the LM
number 1
% Matlab code for interpolation:
params = [ 0.798976 ] % parameters from mitlm interpolation
weights = [ 1 exp(params) ] % weights of individual LMs
probs_LM = [ -1.853477 -3.204265 ] % log10 probabilities of some word W in the
individual LMs
probs_LM = 10.^probs_LM % covert to normal probabilities
probs_LM*weights' / sum(weights) % interpolated probability
log10(ans) % convert to log10 -> the number in the
arpa file
Original comment by michal.f...@gmail.com
on 5 May 2010 at 5:25
Original issue reported on code.google.com by
alu...@gmail.com
on 26 Feb 2009 at 4:42