kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 512 forks source link

interpolate LM created with Kenlm with binary format and arpa format with weight 0.5 and 0.5 for each LM #395

Open MohamedElrefai opened 2 years ago

MohamedElrefai commented 2 years ago

I have tried those commands for converting both to intermediate state to start interpolation of both language models

bin/lmplz -o 3  --intermediate set1.intermediate <lm.binary --skip_symbols
bin/lmplz -o 3  --intermediate set2.intermediate <data.arpa --skip_symbols
bin/interpolate -m set{1,2}.intermediate -w 0.5 0.5 >model.arpa

But it gives an ARPA lm with hashed words which seems due to the binary lm Is there any way to do the interpolation between two different formats?

Output looks like:

-7.998156 0 -7.995031 5æf„kÕROc¬ÇJáЯ:Ž0mJWI•„B#N2Ú?/CÞš| pMFÖõš!uÃôq0„t‹hÜv7×fŒŸÔa+z¥Ãp[‰ÖD£3ò~i8Í℔_—JBO -0.0000030237939 -7.995031 =õô*M -0.0000030237939 -7.995031 æUüÿ‚Œ€gœño®óŒ4þY¿AÇ¿ùø[êø7Âx{ -0.0000030237939 -7.995031 ždp–(µ€Î’Žp„Zà낔Kü^|wI€ÁKö. -0.0000030237939 -7.995031 j0žmzâFÅ¢$ÊÈ!e0$œ2²A

saeidmokaram commented 1 year ago

I have a different problem by doing a similar thing.

bin/lmplz -o 3  --intermediate set1.intermediate < model1.arpa --skip_symbols
bin/lmplz -o 3  --intermediate set2.intermediate < model2.arpa --skip_symbols --discount_fallback
bin/interpolate -m set{1,2}.intermediate -w 0.9 0.1 > mix_model.arpa

and here is the mix_model.arpa looks like this:

\data\
ngram 1=644829
ngram 2=12670052
ngram 3=25758022

\1-grams:
-54.518303      <unk>   0
-52.52658       -0.0171115      -13.50311
-52.52658       -0.00951482     -13.50311
-53.48088       -0.0324709      -5.2374773
-52.52658       -0.0749187      -13.50311
-53.48088       -0.144883       -5.2374773
-40.502262      PAUSED  -16.844946
-53.48088       -0.0970763      -5.2374773
-52.52658       -0.0147354      -13.50311
-52.52658       -0.0316612      -13.50311
-52.52658       -0.0027085      -13.50311
-53.48088       -0.0751489      -5.2374773
-53.48088       -4.1396 -2.1953142e-9
-53.48088       -0.00477613     -5.2374773
-53.48088       -0.0946923      -5.2374773
-53.48088       -0.0940779      -5.2374773
-53.48088       -0.0175576      -5.2374773
-53.48088       -0.0906019      -5.2374773
-53.48088       -0.00636295     -5.2374773
-53.48088       -0.00901661     -5.2374773
-53.48088       -0.241633       -5.2374773
-52.52658       -0.0709945      -13.50311
-53.48088       -0.0280209      -5.2374773
-52.52658       -0.101508       -13.50311
-53.48088       -0.0970096      -5.2374773
-53.48088       -0.0207892      -5.2374773
-39.809536      LOANING -9.177942
-53.48088       -0.0223462      -5.2374773
-53.48088       -5.2487 -6.8420647e-9
-53.48088       -0.0525024      -5.2374773
-53.48088       -0.0678001      -5.2374773
-52.52658       -0.0822201      -13.50311
-52.52658       -0.0873803      -13.50311
-53.48088       -2.0888 -1.8957937e-9
-53.48088       -0.0926938      -5.2374773
-52.52658       -0.143922       -13.50311
-53.48088       -6.5022 -4.0860257e-7
-53.48088       -0.215156       -5.2374773
-51.61946       -0.104799       -12.890841
-53.48088       -2.7527 -2.0234303e-9
-53.48088       -0.0728444      -5.2374773
-52.52658       -0.0054465      -13.50311
-44.464344      BERNICE -15.216694
-53.48088       -0.0780277      -5.2374773
-51.61946       -0.0782491      -12.890841
.
.
.

What are these lines with numbers only? I do the same thing with SRILM with the command below and I have a meaningful ARMA model.

ngram -lm model1.arpa -mix-lm model2.arpa -lambda 0.1 -write-lm mix_model.arpa
saeidmokaram commented 1 year ago

Also to mention the mix_model.arpa made with KenLM is significantly larger than the model1.arpa and model2.arpa. model1.arpa = 216.5 MB model2.arpa = 78.8 MB mix_model.arpa = 1.3 GB

Model mixed using SRILM is : mix_model.arpa = 223.9 MB

Any idea @kpu about this issue?