kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.24k stars 5.32k forks source link

Decoding graph compilation issue #1216

Closed alexnanchen closed 7 years ago

alexnanchen commented 7 years ago

Hello,

I am having some trouble doing LG composition with a lexicon of 150'000 words and a LM of around 80MB: it is blocked in the "determinization" operation (I waited more than two hours).

It seems to be related to the following commit: a8de21fd76f2736d91aab30763a12646cb7c378b

When reverting the "arpa-lm-compiler*" files to the "previous" commit and rebuilding the G.fst, the decoding graph can be built.

There is also a significant decrease of size of the G.fst between "previous" and "latest" commit: 230MB to 130MB.

Alexandre Nanchen

danpovey commented 7 years ago

Run the "broken" determinization, wait for a while (e.g. a minute or two, or the same amount of time the old one took to complete), and then do kill -SIGUSR1 [process id of determinization]

(only works on Linux right now, not mac), and this will print out some info that IIRC is of the form:

word-symbol1 (phone-symbol1a phone-symbol1b) word-symbol2 (...)

.... and you can correlate those with the words.txt and phones.txt; look also at the lexicon.txt to work out the pronunciations concerned.

There should be some kind of 'loop' or frequent repeats symbols in what you see... let us know what that is.

Hard to know right now whether this is a problem in your lexicon, or a deeper problem that needs to be fixed.

Dan

On Fri, Nov 25, 2016 at 3:11 PM, Alexandre Nanchen <notifications@github.com

wrote:

Hello,

I am having some trouble doing LG composition with a lexicon of 150'000 words and a LM of around 80MB: it is blocked in the "determinization" operation (I waited more than two hours).

It seems to be related to the following commit: a8de21f https://github.com/kaldi-asr/kaldi/commit/a8de21fd76f2736d91aab30763a12646cb7c378b

When reverting the "arpa-lm-compiler*" files to the "previous" commit and rebuilding the G.fst, the decoding graph can be built.

There is also a significant decrease of size of the G.fst between "previous" and "latest" commit: 230MB to 130MB.

Alexandre Nanchen

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu-l7JAy9WuRmO570q-iiNWShCggPks5rB0DigaJpZM4K8sSe .

alexnanchen commented 7 years ago

Most of the "word-symbol" do no seem to be associated with "phone-symbols"

153 ( ) 81 ( ) 153 ( ) 73 ( ) 153 ( ) 101 ( ) 189 ( 63729 ) 121 ( ) 181 ( ) 81 ( ) 117 ( ) 13 ( ) 152 ( ) 115 ( 138274 ) 41 ( ) 152 ( ) 47 ( ) 205 ( ) 181 ( ) 73 ( 131878 ) 145 ( ) 193 ( ) 13 ( ) 152 ( ) 55 ( 111816 ) 65 ( ) 193 ( ) 185 ( ) 13 ( ) 153 ( ) 121 ( ) 181 ( ) 25 ( ) 97 ( ) 13 ( ) 152 ( ) 115 ( ) 21 ( ) 189 ( ) 57 ( 135396 ) 73 ( ) 153 ( ) 193 ( ) 185 ( ) 192 ( ) 227 ( ) 33 ( 113647 ) 16 ( ) 232 ( ) 123 ( ) 33 ( ) 153 ( ) 193 ( ) 189 ( ) 184 ( 60295 ) 143 ( ) 21 ( 77095 ) 37 ( ) 141 ( ) 181 ( ) 77 ( ) 192 ( ) 55 ( 108727 ) 73 ( ) 145 ( ) 193 ( ) 125 ( ) 13 ( ) 97 ( ) 13 ( ) 144 ( ) 99 ( ) 109 ( 28832 ) 188 ( ) 233 ( ) 1 ( )

Does this make sense?

Here is the command to build G.fst

cat $LM | \ 
   grep -v '<s> <s>' | \ 
   grep -v '<s> </s>' | \ 
   grep -v '</s> </s>' | \ 
   arpa2fst - | fstprint | \ 
   utils/remove_oovs.pl oovs.txt | \ 
   utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=words.txt \
     --osymbols=words.txt  --keep_isymbols=false --keep_osymbols=false | \ 
    fstrmepsilon | fstarcsort --sort_type=ilabel > G.fst

Alex.

danpovey commented 7 years ago

Sorry I got it the wrong way round, the ones in parentheses are the word symbols. But you may not have waited long enough before sending the signal; if it is blowing up, normally you would see repeats of the same word in that list. It would also be helpful if you figure out the words corresponding to any common repeated word in that list, and list their pronunciations in lexicon.txt.

Dan

On Fri, Nov 25, 2016 at 5:30 PM, Alexandre Nanchen <notifications@github.com

wrote:

Most of the "word-symbol" do no seem to be associated with "phone-symbol"

153 ( ) 81 ( ) 153 ( ) 73 ( ) 153 ( ) 101 ( ) 189 ( 63729 ) 121 ( ) 181 ( ) 81 ( ) 117 ( ) 13 ( ) 152 ( ) 115 ( 138274 ) 41 ( ) 152 ( ) 47 ( ) 205 ( ) 181 ( ) 73 ( 131878 ) 145 ( ) 193 ( ) 13 ( ) 152 ( ) 55 ( 111816 ) 65 ( ) 193 ( ) 185 ( ) 13 ( ) 153 ( ) 121 ( ) 181 ( ) 25 ( ) 97 ( ) 13 ( ) 152 ( ) 115 ( ) 21 ( ) 189 ( ) 57 ( 135396 ) 73 ( ) 153 ( ) 193 ( ) 185 ( ) 192 ( ) 227 ( ) 33 ( 113647 ) 16 ( ) 232 ( ) 123 ( ) 33 ( ) 153 ( ) 193 ( ) 189 ( ) 184 ( 60295 ) 143 ( ) 21 ( 77095 ) 37 ( ) 141 ( ) 181 ( ) 77 ( ) 192 ( ) 55 ( 108727 ) 73 ( ) 145 ( ) 193 ( ) 125 ( ) 13 ( ) 97 ( ) 13 ( ) 144 ( ) 99 ( ) 109 ( 28832 ) 188 ( ) 233 ( ) 1 ( )

Does this make sense?

Here is the command to build G.fst

cat $LM | \ grep -v ' ' | \ grep -v ' ' | \ grep -v ' ' | \ arpa2fst - | fstprint | \ utils/remove_oovs.pl oovs.txt | \ utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=words.txt \ --osymbols=words.txt --keep_isymbols=false --keep_osymbols=false | \ fstrmepsilon | fstarcsort --sort_type=ilabel > G.fst

Alex.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216#issuecomment-263028749, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu2vei6hsapUV9E9dWD76t17dXOoQks5rB2GVgaJpZM4K8sSe .

alexnanchen commented 7 years ago

Ok, make sense.

input phoneme 1, input phoneme 2, ... (output word).

This time I waited 50 minutes. It does not seem to blow up:

( 52073 ) 55 ( ) 193 ( ) 77 ( ) 193 ( ) 185 ( ) 97 ( ) 65 ( ) 189 ( ) 121 ( ) 189 ( ) 145 ( ) 33 ( ) 17 ( ) 13 ( 118103 ) 152 ( ) 179 ( ) 21 ( ) 17 ( ) 193 ( ) 185 ( ) 177 ( ) 161 ( ) 177 ( ) 205 ( ) 145 ( ) 33 ( ) 185 ( ) 193 ( ) 33 ( ) 57 ( ) 13 ( ) 184 ( 102561 ) 143 ( ) 73 ( ) 153 ( ) 132 ( 75611 ) 95 ( ) 181 ( ) 61 ( ) 153 ( ) 12 ( 21329 ) 95 ( ) 133 ( ) 161 ( ) 177 ( ) 13 ( 18916 ) 193 ( ) 181 ( ) 165 ( ) 144 ( ) 31 ( ) 153 ( ) 193 ( ) 21 ( ) 189 ( ) 153 ( ) 13 ( ) 193 ( ) 181 ( ) 105 ( 65010 ) 177 ( ) 205 ( ) 193 ( ) 73 ( ) 193 ( ) 129 ( ) 165 ( ) 152 ( ) 211 ( ) 33 ( ) 193 ( ) 189 ( ) 153 ( 143534 ) 192 ( ) 47 ( ) 61 ( ) 153 ( ) 117 ( ) 21 ( ) 189 ( ) 97 ( ) 181 ( 131582 ) 85 ( ) 217 ( ) 193 ( ) 13 ( ) 152 ( ) 139 ( ) 181 ( ) 73 ( ) 193 ( ) 41 ( 26743 )

Here is some information for the last eleven phonemes:

181 ( 131582 ) 85 ( ) 217 ( ) 193 ( ) 13 ( ) 152 ( ) 139 ( ) 181 ( ) 73 ( ) 193 ( ) 41 ( 26743 )

unverbrauchten (131582) —> Q U n f E six b r aU x t @ n

r_I 181 aU_I 85 x_I 217 t_I 193 @_I 13 n_E 152 k_B 139 r_I 181 a_I 73 t_I 193 O_I 41

craton ( 26743) —> k r a t O n

Alex.

alexnanchen commented 7 years ago

Hello,

Because it did not seem to "blow up" I tried to let it run longer and LG did complete.

Looking at the memory consumption, it turns out that to compile the LG.fst the process need about 9.6GB of RAM.

With the previous version of the files, the "determination" process quickly uses 9.6GB and finishes. It takes around 5 minutes.

With the latest version of the files, the "determination" process takes a lot of time to get to 9.6GB of memory usage. About 4 hours. Then it finishes.

Stochasticity values are also different:

  1. Previous version of the files
    • G.fst : 1.80521 -0.515762
    • LG.fst : -0.014908 -0.0158242
  2. Latest version of the files
    • G.fst : 3.3605 -3.02008 (with message: Reduced num-states from 4653287 to 955887)
    • LG.fst : -0.0154019 -0.0161307

Alex.

danpovey commented 7 years ago

It's great that you noticed this. I'm trying to figure out the most likely reason. Is this a system that has 'silprobs'? I.e. your source lexicon dir contained lexiconp_silprob.txt?

Dan

On Sat, Nov 26, 2016 at 8:49 AM, Alexandre Nanchen <notifications@github.com

wrote:

Hello,

Because it did not seem to "blow up" I tried to let it run longer and LG did complete.

Looking at the memory consumption, it turns out that to compile the LG.fst the computer need about 9.6GB of RAM.

With the previous version of the files, the "determination" process quickly uses 9.6GB and finishes. It takes around 5 minutes.

With the latest version of the files, the "determination" process takes a lot of time to get to 9.6GB of memory usage. About 4 hours. Then it finishes.

Stochasticity values are also different:

  1. Previous version of the files
    • G.fst : 1.80521 -0.515762
    • LG.fst : -0.014908 -0.0158242
  2. Latest version of the files
    • G.fst : 3.3605 -3.02008 (with message: Reduced num-states from 4653287 to 955887)
    • LG.fst : -0.0154019 -0.0161307

Alex.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216#issuecomment-263064422, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu3DnsAWIeaaNKFODhdTQPk56w1p0ks5rCDjugaJpZM4K8sSe .

alexnanchen commented 7 years ago

Don't think so.

The data/dict-test directory has:

lexiconp.txt lexicon.txt nonsilence_phones.txt optional_silence.txt --> sil silence_phones.txt --> sil and spn

Alex.

danpovey commented 7 years ago

Hm. I need to look into this. Is there any chance you could send me by email (dpovey@gmail.com), an archive containing the files and commands necessary to reproduce this? Preferably starting from the ARPA and the lexicon directory. This is just for debugging purposes-- I'll delete them after. What I think is happening is some subtle thing deep inside the determinization algorithm, and I'm not 100% sure what it is right now.

Dan

On Sat, Nov 26, 2016 at 3:32 PM, Alexandre Nanchen <notifications@github.com

wrote:

Don't think so.

The data/dict-test directory has:

lexiconp.txt lexicon.txt nonsilence_phones.txt optional_silence.txt --> sil silence_phones.txt --> sil and spn

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216#issuecomment-263084094, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu1kW4e9VVZbISRo2sLCKbHv7u0Krks5rCJdEgaJpZM4K8sSe .

danpovey commented 7 years ago

Actually, wait a little while... I'll try to reproduce it locally first, and will let you know if I need your specific setup.

On Sat, Nov 26, 2016 at 3:40 PM, Daniel Povey dpovey@gmail.com wrote:

Hm. I need to look into this. Is there any chance you could send me by email (dpovey@gmail.com), an archive containing the files and commands necessary to reproduce this? Preferably starting from the ARPA and the lexicon directory. This is just for debugging purposes-- I'll delete them after. What I think is happening is some subtle thing deep inside the determinization algorithm, and I'm not 100% sure what it is right now.

Dan

On Sat, Nov 26, 2016 at 3:32 PM, Alexandre Nanchen < notifications@github.com> wrote:

Don't think so.

The data/dict-test directory has:

lexiconp.txt lexicon.txt nonsilence_phones.txt optional_silence.txt --> sil silence_phones.txt --> sil and spn

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216#issuecomment-263084094, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu1kW4e9VVZbISRo2sLCKbHv7u0Krks5rCJdEgaJpZM4K8sSe .

alexnanchen commented 7 years ago

Good idea!

Let me know if you have some trouble reproducing the problem.

In that case I can generate for you some fake LM and dictionary (German) that you can download.

The L.fst is generated with utils/prepare_lang.sh $testDict '' $dataTempLangTest $testLang

Alex.

alexnanchen commented 7 years ago

The lexicon contains 150K words and the language model is a 3-grams model pruned to reach a size of 200MB (mitlm for estimation and irstlm for pruning).

Alex.

danpovey commented 7 years ago

I was not able to reproduce the problem. I suspect there may be something weird about the ARPA LM that those toolkits are producing. You'll have to send me some kind of archive.

On Sat, Nov 26, 2016 at 3:57 PM, Alexandre Nanchen <notifications@github.com

wrote:

Good idea!

Let me know if you have some trouble.

In that case I can generate for you some fake LM and dictionary (German) that you can download.

The L.fst is generated with utils/prepare_lang.sh $testDict '' $dataTempLangTest $testLang

Alex.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216#issuecomment-263085564, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9ctJND7U-QdQgkwKl4mLal8895uks5rCJ0mgaJpZM4K8sSe .

danpovey commented 7 years ago

Just realized I did something wrong trying to reproduce it... I may still be able to reproduce it locally.

On Sat, Nov 26, 2016 at 4:23 PM, Daniel Povey dpovey@gmail.com wrote:

I was not able to reproduce the problem. I suspect there may be something weird about the ARPA LM that those toolkits are producing. You'll have to send me some kind of archive.

On Sat, Nov 26, 2016 at 3:57 PM, Alexandre Nanchen < notifications@github.com> wrote:

Good idea!

Let me know if you have some trouble.

In that case I can generate for you some fake LM and dictionary (German) that you can download.

The L.fst is generated with utils/prepare_lang.sh $testDict '' $dataTempLangTest $testLang

Alex.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216#issuecomment-263085564, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9ctJND7U-QdQgkwKl4mLal8895uks5rCJ0mgaJpZM4K8sSe .

danpovey commented 7 years ago

OK, I have figured out the problem. You are using an "older-style" script for formatting the LM. Instead of that long-ish pipe of commands, you should be using something like this (from local/format_lm.sh):

gunzip -c $lm \
  | arpa2fst --disambig-symbol=#0 \
             --read-symbol-table=$out_dir/words.txt - $out_dir/G.fst

However, the script was still supposed to work with the old commands. I think I'll check in a change to arpa2fst so that it doesn't do that new optimization to remove the redundant states if the person did not supply the --disambig-symbol flag (i.e. if it was an older script). This should be harmless.

Dan

On Sat, Nov 26, 2016 at 4:27 PM, Daniel Povey dpovey@gmail.com wrote:

Just realized I did something wrong trying to reproduce it... I may still be able to reproduce it locally.

On Sat, Nov 26, 2016 at 4:23 PM, Daniel Povey dpovey@gmail.com wrote:

I was not able to reproduce the problem. I suspect there may be something weird about the ARPA LM that those toolkits are producing. You'll have to send me some kind of archive.

On Sat, Nov 26, 2016 at 3:57 PM, Alexandre Nanchen < notifications@github.com> wrote:

Good idea!

Let me know if you have some trouble.

In that case I can generate for you some fake LM and dictionary (German) that you can download.

The L.fst is generated with utils/prepare_lang.sh $testDict '' $dataTempLangTest $testLang

Alex.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216#issuecomment-263085564, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9ctJND7U-QdQgkwKl4mLal8895uks5rCJ0mgaJpZM4K8sSe .

alexnanchen commented 7 years ago

Cool!

I am going to try that.

You may want to modify the documentation: http://www.danielpovey.com/kaldi-docs/graph_recipe_test.html#graph_grammar

Alex.

danpovey commented 7 years ago

That is not the 'official' location of the Kaldi documentation, you should look at kaldi-asr.org/doc/. I just removed that location. The documentation in the right location is up to date.

On Sat, Nov 26, 2016 at 4:52 PM, Alexandre Nanchen <notifications@github.com

wrote:

Cool!

I am going to try that.

You may want to modify the documentation: http://www.danielpovey.com/kaldi-docs/graph_recipe_test.html#graph_grammar

Alex.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216#issuecomment-263088115, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu4dJPfVnti-mjZhGAPYfg5bjMjnNks5rCKoogaJpZM4K8sSe .

alexnanchen commented 7 years ago

Ah, shouldn't rely on Google blindly.

Is this command line right?

cat $LM | \ 
   grep -v '<s> <s>' | \ 
   grep -v '<s> </s>' | \ 
   grep -v '</s> </s>' | \ 
   arpa2fst --disambig-symbol=#0 --read-symbol-table=words.txt - | fstprint | \ 
   utils/remove_oovs.pl oovs.txt | \ 
   utils/s2eps.pl | fstcompile --isymbols=words.txt \
     --osymbols=words.txt  --keep_isymbols=false --keep_osymbols=false | \ 
    fstrmepsilon | fstarcsort --sort_type=ilabel > G.fst

Alex.

danpovey commented 7 years ago

No, it's literally just the lines I showed you, you remove everything else. We put all that into arpa2fst.

On Sat, Nov 26, 2016 at 5:06 PM, Alexandre Nanchen <notifications@github.com

wrote:

Ah, shouldn't rely on Google blindly.

Is this command line right?

cat $LM | \ grep -v ' ' | \ grep -v ' ' | \ grep -v ' ' | \ arpa2fst --disambig-symbol=#0 --read-symbol-table=words.txt - | fstprint | \ utils/remove_oovs.pl oovs.txt | \ utils/s2eps.pl | fstcompile --isymbols=words.txt \ --osymbols=words.txt --keep_isymbols=false --keep_osymbols=false | \ fstrmepsilon | fstarcsort --sort_type=ilabel > G.fst

Alex.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1216#issuecomment-263088725, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu-xBTFDPw2hFW_nyS7rvLsCesclwks5rCK1bgaJpZM4K8sSe .

alexnanchen commented 7 years ago

Ok, all working with latest version of the files!

Just for information, stochasticity values are now

Many thanks!

Have a nice week-end!

Alex.