Open MohamedElrefai opened 2 years ago
I have a different problem by doing a similar thing.
bin/lmplz -o 3 --intermediate set1.intermediate < model1.arpa --skip_symbols
bin/lmplz -o 3 --intermediate set2.intermediate < model2.arpa --skip_symbols --discount_fallback
bin/interpolate -m set{1,2}.intermediate -w 0.9 0.1 > mix_model.arpa
and here is the mix_model.arpa looks like this:
\data\
ngram 1=644829
ngram 2=12670052
ngram 3=25758022
\1-grams:
-54.518303 <unk> 0
-52.52658 -0.0171115 -13.50311
-52.52658 -0.00951482 -13.50311
-53.48088 -0.0324709 -5.2374773
-52.52658 -0.0749187 -13.50311
-53.48088 -0.144883 -5.2374773
-40.502262 PAUSED -16.844946
-53.48088 -0.0970763 -5.2374773
-52.52658 -0.0147354 -13.50311
-52.52658 -0.0316612 -13.50311
-52.52658 -0.0027085 -13.50311
-53.48088 -0.0751489 -5.2374773
-53.48088 -4.1396 -2.1953142e-9
-53.48088 -0.00477613 -5.2374773
-53.48088 -0.0946923 -5.2374773
-53.48088 -0.0940779 -5.2374773
-53.48088 -0.0175576 -5.2374773
-53.48088 -0.0906019 -5.2374773
-53.48088 -0.00636295 -5.2374773
-53.48088 -0.00901661 -5.2374773
-53.48088 -0.241633 -5.2374773
-52.52658 -0.0709945 -13.50311
-53.48088 -0.0280209 -5.2374773
-52.52658 -0.101508 -13.50311
-53.48088 -0.0970096 -5.2374773
-53.48088 -0.0207892 -5.2374773
-39.809536 LOANING -9.177942
-53.48088 -0.0223462 -5.2374773
-53.48088 -5.2487 -6.8420647e-9
-53.48088 -0.0525024 -5.2374773
-53.48088 -0.0678001 -5.2374773
-52.52658 -0.0822201 -13.50311
-52.52658 -0.0873803 -13.50311
-53.48088 -2.0888 -1.8957937e-9
-53.48088 -0.0926938 -5.2374773
-52.52658 -0.143922 -13.50311
-53.48088 -6.5022 -4.0860257e-7
-53.48088 -0.215156 -5.2374773
-51.61946 -0.104799 -12.890841
-53.48088 -2.7527 -2.0234303e-9
-53.48088 -0.0728444 -5.2374773
-52.52658 -0.0054465 -13.50311
-44.464344 BERNICE -15.216694
-53.48088 -0.0780277 -5.2374773
-51.61946 -0.0782491 -12.890841
.
.
.
What are these lines with numbers only? I do the same thing with SRILM with the command below and I have a meaningful ARMA model.
ngram -lm model1.arpa -mix-lm model2.arpa -lambda 0.1 -write-lm mix_model.arpa
Also to mention the mix_model.arpa made with KenLM is significantly larger than the model1.arpa and model2.arpa. model1.arpa = 216.5 MB model2.arpa = 78.8 MB mix_model.arpa = 1.3 GB
Model mixed using SRILM is : mix_model.arpa = 223.9 MB
Any idea @kpu about this issue?
I have tried those commands for converting both to intermediate state to start interpolation of both language models
But it gives an ARPA lm with hashed words which seems due to the binary lm Is there any way to do the interpolation between two different formats?
Output looks like: