kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.15k stars 5.32k forks source link

make_biased_lm_graphs.sh seems to be ignoring the "top-n-words" property #1403

Closed nizmagu closed 7 years ago

nizmagu commented 7 years ago

I wanted to use clean_and_segment_data.sh to clean some transcriptions decoded by kaldi. It didn't work so well on the first run, so I tried to set '--graph-opts "--top-n-words 300"' in my call to the script. That didn't seem to make any difference, so I tried it with "--top-n-words 1000 --top-n-words-weight 2" and it didn't produce any different results either.

Upon executing make_biased_lm_graphs.sh with "--top-n-words 300", the "fsts" folder was identical, byte-for-byte, to the folder created by using top-n-words of 100 or 1000, and top-n-words-weight of 2.

The thing --top-n-words seems to change is the "top_words.txt" and "top_words.int" files, which apparently have no bearing on the creation of the biased LM graphs, or the whole process in general.

After LM graphs, the per_utt_details.txt file made using top-n-words of 300 was identical to the ones made using --top-n-words of 100 and 1000, and top-n-words-weight of 2.

Also, in find_bad_utts.sh, top_words.int seemed to be dealt with, but it's absent in clean_and_segment_data.sh, or at least doesn't make a noticeable difference in the result.

Could it be because we have done something wrong in our code, or because there is a bug in the software? I unfortunately don't have a good enough understanding of the program to determine myself.

danpovey commented 7 years ago

I wanted to use clean_and_segment_data.sh to clean some transcriptions decoded by kaldi.

clean_and_segment_data.sh is not intended to operate on computer-generated transcriptions, especially not those decoded by Kaldi itself (if they came from a totally different system, there might be a small benefit).

@vimalmanohar, could you please add a comment at the top of the script clarifying that it's intended to operate on top of human transcriptions?

If you want to segment the output of Kaldi decoding (e.g. get smaller segments), you can look at the steps in egs/wsj/s5/local/run_segmentation.sh. The first couple of steps are just creating a suitable test setup for the script by appending utterances together, they are not really part of it.

I won't respond regarding the details of how clean_and_segment_data.sh works, because they are probably not relevant to you.

@vimalmanohar, do we have anything that is suitable for semi-supervised training, where we filter by confidence? Did we ever merge any of that stuff?

Dan

It didn't work so well on the first run, so I tried to set '--graph-opts

"--top-n-words 300"' in my call to the script. That didn't seem to make any difference, so I tried it with "--top-n-words 1000 --top-n-words-weight 2" and it didn't produce any different results either.

Upon executing make_biased_lm_graphs.sh with "--top-n-words 300", the "fsts" folder was identical, byte-for-byte, to the folder created by using top-n-words of 100 or 1000, and top-n-words-weight of 2.

The thing --top-n-words seems to change is the "top_words.txt" and " top_words.int" files, which apparently have no bearing on the creation of the biased LM graphs, or the whole process in general.

After LM graphs, the per_utt_details.txt file made using top-n-words of 300 was identical to the ones made using --top-n-words of 100 and 1000, and top-n-words-weight of 2.

Also, in find_bad_utts.sh, top_words.int seemed to be dealt with, but it's absent in clean_and_segment_data.sh, or at least doesn't make a noticeable difference in the result.

Could it be because we have done something wrong in our code, or because there is a bug in the software? I unfortunately don't have a good enough understanding of the program to determine myself.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1403, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu70XhIQNHWDasDcc7CG3Oz52L0Cbks5raK-WgaJpZM4L50wL .