kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.11k stars 5.31k forks source link

kaldi/egs/wsj/s5/steps/cleanup/make_segmentation_graph.sh & make_utterance_graph.sh #837

Closed vince62s closed 8 years ago

vince62s commented 8 years ago

In these 2 scripts, the section below does not work (at least for me on Ubuntu) what's in bold needs to be removed.

if [ $ngram_order -gt 1 ]; then ngram_count=which ngram-count; if [ -z $ngram_count ]; then if uname -a | grep 64 >/dev/null; then # some kind of 64 bit... sdir=pwd/../../../tools/srilm/bin/i686-m64 else sdir=pwd/../../../tools/srilm/bin/i686 fi if [ -f $sdir/ngram-count ]; then echo Using SRILM tools from $sdir export PATH=$PATH:$sdir else echo You appear to not have SRILM tools installed, either on your path, echo or installed in $sdir. See tools/install_srilm.sh for installation echo instructions. exit 1 fi fi fi

jtrmal commented 8 years ago

Removing it does not really make any sense, if you understand what it's supposed to do. When you say "does not work", what that does mean?

vince62s commented 8 years ago

I know what this is supposed to do but the line ngram_count=which ngram-count; terminates and exit the script. might be caused by the set -e at the beginning of the script which make the which exit with non zero status.

jtrmal commented 8 years ago

Yes, that would be the cause for the termination. changing the line to

ngram_count=which ngram-count || true

should fix that -- can you please test? It would be great if you could create PR (if it works). Thanks! y.

On Fri, Jun 10, 2016 at 2:08 PM, vince62s notifications@github.com wrote:

I know what this is supposed to do but the line ngram_count=which ngram-count; terminates and exit the script. might be caused by the set -e at the beginning of the script which make the which exit with non zero status.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/837#issuecomment-225165325, or mute the thread https://github.com/notifications/unsubscribe/AKisXys_0-GqK6JCX_8bBfIT5I0fD6uWks5qKVPNgaJpZM4IyxbZ .

vince62s commented 8 years ago

fixed

vince62s commented 8 years ago

actually I got another issue with make_utterance_graph.sh log gives me this: steps/cleanup/make_utterance_graph.sh --cleanup true --tscale 1.0 --loopscale 0.1 --ngram-order 2 --srilm-options -wbdiscount exp/tri3_mmi_b0.1/graph_source_split/split1/1/text data/lang exp/tri3_mmi_b0.1 exp/tri3_mmi_b0.1/graph_source_split/split1/1 steps/cleanup/make_utterance_graph.sh: processing utterance utt1. awk: program limit exceeded: maximum number of fields size=32767 FILENAME="-" FNR=1 NR=1

Accounting: time=0 threads=1

Ended (code 2) at Tue Jun 14 21:20:04 CEST 2016, elapsed time 0 seconds

suggesting an issue with the awk command here:

Compiles G.fst

if [ $ngram_order -eq 1 ]; then echo $words > $wdir/text cat $wdir/text | utils/sym2int.pl --map-oov $oov -f 1- $lang/words.txt | \ utils/make_unigram_grammar.pl | fstcompile |\ fstarcsort --sort_type=ilabel > $wdir/G.fst || exit 1; else echo $words | awk -v voc=$lang/words.txt -v oov="$oov_txt" ' BEGIN{ while((getline<voc)>0) { invoc[$1]=1; } } { for (x=1;x<=NF;x++) { if (invoc[$x]) { printf("%s ", $x); } else { printf("%s ", oov); } } printf("\n"); }' > $wdir/text ngram-count -text $wdir/text -order $ngram_order "$srilm_options" -lm - |\ arpa2fst --disambig-symbol=#0 \ --read-symbol-table=$lang/words.txt - $wdir/G.fst || exit 1; fi

is there an easy workaround for awk accepting this or using something else ? (this occurs because I am processing a long utterance with long text , moren than 1 hours)

jtrmal commented 8 years ago

The error suggests the line has more than 38k words. I'm not sure if it's even reasonable to have such long utterances? Anyways, it seems from the google search that gawk might not have this limitation -- so perhaps install it and make a symlink to awk (if the package manager won't do it for you). Not sure if it will work, though. Another option would be rewrite it and read the text from a file in perl... I'm worried that having this long command line might be touching some internal bash/linux limits as well and you will have problems even after using gawk.

danpovey commented 8 years ago

It looks to me like make_utterance_graph.sh is relying on 'echo'-ing, on a single line, the entire vocabulary, and reading this in awk on a single line. None of these utilities were designed to read or write so much data on a single line. I think the script should be redesigned. If you can do this, it would be great. Dan

On Tue, Jun 14, 2016 at 4:17 PM, jtrmal notifications@github.com wrote:

The error suggests the line has more than 38k words. I'm not sure if it's even reasonable to have such long utterances? Anyways, it seems from the google search that gawk might not have this limitation -- so perhaps install it and make a symlink to awk (if the package manager won't do it for you). Not sure if it will work, though. Another option would be rewrite it and read the text from a file in perl... I'm worried that having this long command line might be touching some internal bash/linux limits as well and you will have problems even after using gawk.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/837#issuecomment-226002929, or mute the thread https://github.com/notifications/unsubscribe/ADJVu9aAbTBYGndLNY3Ac8JqTbI0MfBwks5qLwxlgaJpZM4IyxbZ .

vince62s commented 8 years ago

OK; gawk seems to work. But yes these few lines could be re-written since ngram-count can directly handle the vocab file and the oov mapping. so this awk thing is not needed here. but I also read in this post https://github.com/kaldi-asr/kaldi/pull/777 that srilm should not be used in these processes, so maybe we'll wait until all of this data cleanup be rewritten properly. (hoping the order >=2 will be taken into account).

vince62s commented 8 years ago

I spoke too fast, yenda was right about the subsequent issues. still testing.

vince62s commented 8 years ago

so putting the awk issue apart, the ngram-count command has some sort of c++ limit I guess. a 1 line text with 256542 characters / 44969 words is fine a 1 line text with 283251 characters / 51418 words is not. Limit is in between.

danpovey commented 8 years ago

Having a dependency on SRILM is not great anyway. Vimal has been experimenting with various ways to make these graphs. He has a WIP pull request #777 about this. In there, there is a script make_biased_lm_graph.sh which does the job-- probably better than our current SRILM-based script, as he did a bunch of experiments.

I don't know if the interface is the same or not. My plan was to check that stuff in in a couple of weeks, after reworking certain aspects of it myself. We may be able to re-use parts of that for the resegmentation stuff.

BTW, as a quick fix for the SRILM problem you can probably just split the line arbitrarily after each, say, 1000 words. If that works, please check in the fix. Maybe best to do that in perl to get around any awk limitation.

Dan

On Wed, Jun 15, 2016 at 1:58 PM, vince62s notifications@github.com wrote:

so putting the awk issue apart, the ngram-count command has some sort of c++ limit I guess. a 1 line text with 256542 characters / 44969 words is fine a 1 line text with 283251 characters / 51418 words is not. Limit is in between.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/837#issuecomment-226268387, or mute the thread https://github.com/notifications/unsubscribe/ADJVu7mu_LukQV6b4wzj3HVySWMT7YGjks5qMD1AgaJpZM4IyxbZ .

vince62s commented 8 years ago

checked in a fix until WIP #777 is finished. Cheers.