interpolate LM created with Kenlm with binary format and arpa format with weight 0.5 and 0.5 for each LM - Githubissues

kpu / kenlm

KenLM: Faster and Smaller Language Model Queries

http://kheafield.com/code/kenlm/

Other

2.5k stars 512 forks source link

interpolate LM created with Kenlm with binary format and arpa format with weight 0.5 and 0.5 for each LM #395

Open MohamedElrefai opened 2 years ago

MohamedElrefai commented 2 years ago

I have tried those commands for converting both to intermediate state to start interpolation of both language models

bin/lmplz -o 3  --intermediate set1.intermediate <lm.binary --skip_symbols
bin/lmplz -o 3  --intermediate set2.intermediate <data.arpa --skip_symbols
bin/interpolate -m set{1,2}.intermediate -w 0.5 0.5 >model.arpa

But it gives an ARPA lm with hashed words which seems due to the binary lm Is there any way to do the interpolation between two different formats?

Output looks like:

-7.998156 0 -7.995031 5æfkÕROc¬ÇJáÐ¯:0mJWIB#N2Ú?/CÞ|pMFÖõš!uÃôq0thÜv7×fŒŸÔa+z¥Ãp[ÖD£3ò~i8Íâ_JBO -0.0000030237939 -7.995031 =õô*M -0.0000030237939 -7.995031 æUüÿgño®ó4þY¿AÇ¿ùø[êø7Âx{ -0.0000030237939 -7.995031 ždp(µ€ÎŽpZàëKü^|wIÁKö. -0.0000030237939 -7.995031 j0mzâFÅ¢$ÊÈ!e0$2²A

saeidmokaram commented 1 year ago

I have a different problem by doing a similar thing.

bin/lmplz -o 3  --intermediate set1.intermediate < model1.arpa --skip_symbols
bin/lmplz -o 3  --intermediate set2.intermediate < model2.arpa --skip_symbols --discount_fallback
bin/interpolate -m set{1,2}.intermediate -w 0.9 0.1 > mix_model.arpa

and here is the mix_model.arpa looks like this:

\data\
ngram 1=644829
ngram 2=12670052
ngram 3=25758022

\1-grams:
-54.518303      <unk>   0
-52.52658       -0.0171115      -13.50311
-52.52658       -0.00951482     -13.50311
-53.48088       -0.0324709      -5.2374773
-52.52658       -0.0749187      -13.50311
-53.48088       -0.144883       -5.2374773
-40.502262      PAUSED  -16.844946
-53.48088       -0.0970763      -5.2374773
-52.52658       -0.0147354      -13.50311
-52.52658       -0.0316612      -13.50311
-52.52658       -0.0027085      -13.50311
-53.48088       -0.0751489      -5.2374773
-53.48088       -4.1396 -2.1953142e-9
-53.48088       -0.00477613     -5.2374773
-53.48088       -0.0946923      -5.2374773
-53.48088       -0.0940779      -5.2374773
-53.48088       -0.0175576      -5.2374773
-53.48088       -0.0906019      -5.2374773
-53.48088       -0.00636295     -5.2374773
-53.48088       -0.00901661     -5.2374773
-53.48088       -0.241633       -5.2374773
-52.52658       -0.0709945      -13.50311
-53.48088       -0.0280209      -5.2374773
-52.52658       -0.101508       -13.50311
-53.48088       -0.0970096      -5.2374773
-53.48088       -0.0207892      -5.2374773
-39.809536      LOANING -9.177942
-53.48088       -0.0223462      -5.2374773
-53.48088       -5.2487 -6.8420647e-9
-53.48088       -0.0525024      -5.2374773
-53.48088       -0.0678001      -5.2374773
-52.52658       -0.0822201      -13.50311
-52.52658       -0.0873803      -13.50311
-53.48088       -2.0888 -1.8957937e-9
-53.48088       -0.0926938      -5.2374773
-52.52658       -0.143922       -13.50311
-53.48088       -6.5022 -4.0860257e-7
-53.48088       -0.215156       -5.2374773
-51.61946       -0.104799       -12.890841
-53.48088       -2.7527 -2.0234303e-9
-53.48088       -0.0728444      -5.2374773
-52.52658       -0.0054465      -13.50311
-44.464344      BERNICE -15.216694
-53.48088       -0.0780277      -5.2374773
-51.61946       -0.0782491      -12.890841
.
.
.

What are these lines with numbers only? I do the same thing with SRILM with the command below and I have a meaningful ARMA model.

ngram -lm model1.arpa -mix-lm model2.arpa -lambda 0.1 -write-lm mix_model.arpa

saeidmokaram commented 1 year ago

Also to mention the mix_model.arpa made with KenLM is significantly larger than the model1.arpa and model2.arpa. model1.arpa = 216.5 MB model2.arpa = 78.8 MB mix_model.arpa = 1.3 GB

Model mixed using SRILM is : mix_model.arpa = 223.9 MB

Any idea @kpu about this issue?