kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.18k stars 5.32k forks source link

Error parsing somewhat malformed SRILM generated ARPA file #643

Closed kkm000 closed 8 years ago

kkm000 commented 8 years ago

See #639.

@kkm000: What I can do is ignore highest-order n-grams down until there is an order of non-zero cardinality. I think this is what actually happens here. This could be useful as it can potentially interfere with the selection of packed vs regular history keys. Packed keys support up to 4-grams, inclusively. If there is a 0 count 5-gram section, that will pessimize the compile.

@danpovey: Sure, if it's easy enough. .. but probably still print a warning. It's a feature, not a bug.

mmmaat commented 8 years ago

A failing ARPA file is G.arpa.gz. Here its metainfo:

$ zcat G.arpa.gz | grep '^\\\|^ngram'
\data\
ngram  1=        61
ngram  2=      1290
\1-grams:
\2-grams:
\end\

I hope this will help !

danpovey commented 8 years ago

I notice that there is no blank line between the \2-grams section and the \end- that might be the reason.

Dealing with ARPA-format files is very frustrating because there is no formal definition of what a valid ARPA-format is, so we can never point to 'broken' ARPA inputs and say with certainty that they are buggy.

Dan

On Sun, Apr 3, 2016 at 8:22 AM, Mathieu Bernard notifications@github.com wrote:

A failing ARPA file is G.arpa.gz https://github.com/kaldi-asr/kaldi/files/201311/G.arpa.gz. Here its metainfo:

$ zcat G.arpa.gz | grep '^|^ngram' \data\ ngram 1= 61 ngram 2= 1290 \1-grams: \2-grams: \end\

I hope this will help !

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/643#issuecomment-204958144

danpovey commented 8 years ago

Mathieu, where exactly did you get this ARPA file? Is it in one of the standard Kaldi scripts? It does seem more-than-usually broken- I'm wondering whether we should tolerate it.

Dan

On Sun, Apr 3, 2016 at 1:39 PM, Daniel Povey dpovey@gmail.com wrote:

I notice that there is no blank line between the \2-grams section and the \end- that might be the reason.

Dealing with ARPA-format files is very frustrating because there is no formal definition of what a valid ARPA-format is, so we can never point to 'broken' ARPA inputs and say with certainty that they are buggy.

Dan

On Sun, Apr 3, 2016 at 8:22 AM, Mathieu Bernard notifications@github.com wrote:

A failing ARPA file is G.arpa.gz https://github.com/kaldi-asr/kaldi/files/201311/G.arpa.gz. Here its metainfo:

$ zcat G.arpa.gz | grep '^|^ngram' \data\ ngram 1= 61 ngram 2= 1290 \1-grams: \2-grams: \end\

I hope this will help !

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/643#issuecomment-204958144

mmmaat commented 8 years ago

I'm writing abkhazia, a WIP python program/library defining a speech corpus format and using kaldi to generate language models, acoustic models, force alignment and decoding on different corpora. For exemple, import the Buckeye corpus and compute a bigram word level lm on it in only 2 steps:

$ abkhazia prepare buckeye -i /path/to/raw/buckeye
$ abkhazia language buckeye --model-order 2 --model-level word

The ARPA file comes from a test case computing a bigram at phone level from 1% of buckeye. This was working up to a recent commit of Kaldi (works with the version of 10 Feb. 2016), maybe as you mentioned, some change in arpa2fst ?

danpovey commented 8 years ago

Yes, we did change arpa2fst. But what I'm wondering is, what program generated that ARPA file? Because it is broken. There needs to be a blank line before the \end. I'm just not sure that we want to support that particular kind of brokenness.

Dan

On Sun, Apr 3, 2016 at 4:03 PM, Mathieu Bernard notifications@github.com wrote:

I'm writing abkhazia https://github.com/bootphon/abkhazia/, a WIP python program/library defining a speech corpus format and using kaldi to generate language models, acoustic models, force alignment and decoding on different corpora. For exemple, import the Buckeye corpus and compute a bigram word level lm on it in only 2 steps:

$ abkhazia prepare buckeye -i /path/to/raw/buckeye $ abkhazia language buckeye --model-order 2 --model-level word

The ARPA file comes from a test case computing a bigram at phone level from 1% of buckeye. This was working up to a recent commit of Kaldi (works with the version of 10 Feb. 2016), maybe as you mentioned, some change in arpa2fst ?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/643#issuecomment-205045775

mmmaat commented 8 years ago

The missing blank lines is introduced by build-lm.sh from IRSTLM

mmmaat commented 8 years ago

With text_se.txt, build-lm.sh -i text_se.txt -n 2 -o text_lm.gz -k 1 -s kneser-ney produces text_lm.gz

danpovey commented 8 years ago

OK. We probably do want to support this type of broken ARPA file without error, then, because some of the recipes use IRSTLM.

(cc'ing kaldi-developers because some of this might be of interest to others).

When we started Kaldi, we were using IRSTLM because it has a freer license than SRILM. However, the IRSTLM project has died, while SRILM is still being updated with bug-fixes; and SRILM was always more feature complete, better documented and more widely used than IRSTLM. So lately we rely on SRILM more and more.

An open question right now is where we should go in terms of support for more advanced (non-ARPA) language models. One option is to add code-level support for SRILM, but that's not so attractive because (a) the license is a hassle and we wouldn't want to make it mandatory to have SRILM in order to compile Kaldi, and (b) while SRILM is being updated with bug-fixes, it's not really keeping pace with the state of the art in language modeling (neural nets, etc.). What would be best is if there were a toolkit like Kaldi itself, but in the space of language modeling.

We may actually decide to work on our own language-modeling tools. For instance, it wouldn't be that much more work to add support for RNNLMs, building on top of existing neural net capabilities of Kaldi.

Dan

On Sun, Apr 3, 2016 at 4:35 PM, Mathieu Bernard notifications@github.com wrote:

The missing blank lines is introduced by build-lm.sh from IRSTLM

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/643#issuecomment-205050097

kkm000 commented 8 years ago

Let me check what's wrong here. Just by eyeballing, the file looks ok. The absense of a blank line should not (in theory) be a problem.

danpovey commented 8 years ago

@mmmaat, can you show us how exactly arpa2fst is failing for this file? I notice your original PR had the details of a different failure, for a different file.

On Sun, Apr 3, 2016 at 5:37 PM, Kirill Katsnelson notifications@github.com wrote:

Let me check what's wrong here. Just by eyeballing, the file looks ok. The absense of a blank line should not (in theory) be a problem.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/643#issuecomment-205060138

kkm000 commented 8 years ago

Looks like I do not reproduce the problem using this file!

$ wget -qO- https://github.com/kaldi-asr/kaldi/files/201311/G.arpa.gz | gunzip -c | ../lmbin/arpa2fst - /dev/null
../lmbin/arpa2fst - /dev/null
LOG (arpa2fst:Read():arpa-file-parser.cc:90) Reading \data\ section.
LOG (arpa2fst:Read():arpa-file-parser.cc:147) Reading \1-grams: section.
LOG (arpa2fst:Read():arpa-file-parser.cc:147) Reading \2-grams: section.

What happens when you simply do arpa2fst G.arpa /dev/null?

mmmaat commented 8 years ago

I get the same results with arpa2fst, this is working. But the original problem I fixed in my PR is a bug introduced by change-lm-vocab from SRILM. It introduces an empty 3-gram and is unrelated to the missing empty line. So the bug appears when using arpa2fst after change-lm-vocab, as in the wsj/utils/format_lm_sri.sh.

Here a minimal script that reproduces the bug from G.arpa.gz and words.txt:

srilm_opts="-subset -prune-lowprobs -unk -tolower"
dir=$(pwd)/crash_files
mkdir -p $dir
wget https://github.com/kaldi-asr/kaldi/files/201311/G.arpa.gz -O $dir/lm.gz
wget https://github.com/kaldi-asr/kaldi/files/202196/words.txt -O $dir/words.txt

# NOTE you need to change this path to your local setup
cd ~/dev/kaldi/egs/wsj/s5/
. path.sh

awk '{print $1}' $dir/words.txt > $dir/voc
change-lm-vocab -vocab $dir/voc -lm $dir/lm.gz -write-lm $dir/out_lm $srilm_opts
arpa2fst $dir/out_lm /dev/null
mmmaat commented 8 years ago

In the previous exemple script, I didn't do the following step (while present in wsj/utils/format_lm_sri.sh) as this doesn't impact the bug.

# Removing all "illegal" combinations of <s> and </s>, which are supposed to
# occur only at being/end of utt.  These can cause determinization failures
# of CLG [ends up being epsilon cycles].
gunzip -c $dir/in_lm.gz \
  | egrep -v '<s> <s>|</s> <s>|</s> </s>' \
  | gzip -c > $dir/lm.gz
danpovey commented 8 years ago

I my experience OpenGRM is best open tool right now, plus it is openfst friendly. http://www.openfst.org/twiki/bin/view/GRM/NGramLibrary

You can do pruning with.

danpovey commented 8 years ago

OK, but we still want to support the IRSTLM setup.

@mmmat, it would be better if you attach $dir/out_lm and also show us what the error message is that arpa2fst produces. That way we won't have to install IRSTLM in order to reproduce the problem. "Minimal reproduction".

Dan

On Mon, Apr 4, 2016 at 2:17 PM, Ilya Platonov realill@gmail.com wrote:

I my experience OpenGRM is best open tool right now, plus it is openfst friendly. http://www.openfst.org/twiki/bin/view/GRM/NGramLibrary

You can do pruning with.

mmmaat commented 8 years ago

Here is out_lm. The error message is

arpa2fst /home/mbernard/fichiers_entrants/pr_kaldi/crash_files/out_lm /dev/null 
LOG (arpa2fst:Read():arpa-file-parser.cc:90) Reading \data\ section.
LOG (arpa2fst:Read():arpa-file-parser.cc:147) Reading \1-grams: section.
LOG (arpa2fst:Read():arpa-file-parser.cc:147) Reading \2-grams: section.
WARNING (arpa2fst:Read():arpa-file-parser.cc:134) Zero ngram count in ngram order 3(look for 'ngram 3=0' in the \data\  section). There is possibly a problem with the file.
ERROR (arpa2fst:Read():arpa-file-parser.cc:228) in line 1362: Invalid or unexpected directive line '\3-grams:', expected \end\.
ERROR (arpa2fst:Read():arpa-file-parser.cc:228) in line 1362: Invalid or unexpected directive line '\3-grams:', expected \end\.

[stack trace: ]
kaldi::KaldiGetStackTrace()
kaldi::KaldiErrorMessage::~KaldiErrorMessage()
kaldi::ArpaFileParser::Read(std::istream&, bool)
arpa2fst(main+0x678) [0x40f298]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f2aa8318b45]
arpa2fst() [0x410282]
danpovey commented 8 years ago

OK. I have a vague memory that a long time ago we had a discussion about what to do with zero n-gram counts, and I said, just print a warning and ignore them, it's fine, and someone (kirill or guoguo, maybe) wanted to add code to do something about hem. I suspect that some code was added to do something about them, and that code causing a problem, and I think the right thing to do would be to take that code out. Dan

On Mon, Apr 4, 2016 at 2:52 PM, Mathieu Bernard notifications@github.com wrote:

Here is out_lm https://github.com/kaldi-asr/kaldi/files/203111/out_lm.txt. The error message is

arpa2fst /home/mbernard/fichiers_entrants/pr_kaldi/crash_files/out_lm /dev/null LOG (arpa2fst:Read():arpa-file-parser.cc:90) Reading \data\ section. LOG (arpa2fst:Read():arpa-file-parser.cc:147) Reading \1-grams: section. LOG (arpa2fst:Read():arpa-file-parser.cc:147) Reading \2-grams: section. WARNING (arpa2fst:Read():arpa-file-parser.cc:134) Zero ngram count in ngram order 3(look for 'ngram 3=0' in the \data\ section). There is possibly a problem with the file. ERROR (arpa2fst:Read():arpa-file-parser.cc:228) in line 1362: Invalid or unexpected directive line '\3-grams:', expected \end. ERROR (arpa2fst:Read():arpa-file-parser.cc:228) in line 1362: Invalid or unexpected directive line '\3-grams:', expected \end.

[stack trace: ] kaldi::KaldiGetStackTrace() kaldi::KaldiErrorMessage::~KaldiErrorMessage() kaldi::ArpaFileParser::Read(std::istream&, bool) arpa2fst(main+0x678) [0x40f298] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f2aa8318b45] arpa2fst() [0x410282]

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/643#issuecomment-205445292

danpovey commented 8 years ago

OK, I just pushed a fix for this- I had to remove a statement in the arpa-file-parser.cc.

On Mon, Apr 4, 2016 at 2:55 PM, Daniel Povey dpovey@gmail.com wrote:

OK. I have a vague memory that a long time ago we had a discussion about what to do with zero n-gram counts, and I said, just print a warning and ignore them, it's fine, and someone (kirill or guoguo, maybe) wanted to add code to do something about hem. I suspect that some code was added to do something about them, and that code causing a problem, and I think the right thing to do would be to take that code out. Dan

On Mon, Apr 4, 2016 at 2:52 PM, Mathieu Bernard notifications@github.com wrote:

Here is out_lm https://github.com/kaldi-asr/kaldi/files/203111/out_lm.txt. The error message is

arpa2fst /home/mbernard/fichiers_entrants/pr_kaldi/crash_files/out_lm /dev/null LOG (arpa2fst:Read():arpa-file-parser.cc:90) Reading \data\ section. LOG (arpa2fst:Read():arpa-file-parser.cc:147) Reading \1-grams: section. LOG (arpa2fst:Read():arpa-file-parser.cc:147) Reading \2-grams: section. WARNING (arpa2fst:Read():arpa-file-parser.cc:134) Zero ngram count in ngram order 3(look for 'ngram 3=0' in the \data\ section). There is possibly a problem with the file. ERROR (arpa2fst:Read():arpa-file-parser.cc:228) in line 1362: Invalid or unexpected directive line '\3-grams:', expected \end. ERROR (arpa2fst:Read():arpa-file-parser.cc:228) in line 1362: Invalid or unexpected directive line '\3-grams:', expected \end.

[stack trace: ] kaldi::KaldiGetStackTrace() kaldi::KaldiErrorMessage::~KaldiErrorMessage() kaldi::ArpaFileParser::Read(std::istream&, bool) arpa2fst(main+0x678) [0x40f298] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f2aa8318b45] arpa2fst() [0x410282]

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/643#issuecomment-205445292

mmmaat commented 8 years ago

Great, thank you!

kkm000 commented 8 years ago

@mmmaat, thanks for the sample.

@danpovey -- I do not see you fix, where?

danpovey commented 8 years ago

commit d03abac, I pushed directly Dan

On Mon, Apr 4, 2016 at 3:33 PM, Kirill Katsnelson notifications@github.com wrote:

@mmmaat https://github.com/mmmaat, thanks for the sample.

@danpovey https://github.com/danpovey -- I do not see you fix, where?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/643#issuecomment-205461056

kkm000 commented 8 years ago

Weird. Still cant find.

$ git fetch golden
remote: Counting objects: 5, done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 5 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (5/5), done.
From git://github.com/kaldi-asr/kaldi
   c972c79..0565cce  master     -> golden/master
$ git show d03abac
fatal: ambiguous argument 'd03abac': unknown revision or path not in the working tree.
danpovey commented 8 years ago

Sorry! Forgot to push. Just did so.

On Mon, Apr 4, 2016 at 4:16 PM, Kirill Katsnelson notifications@github.com wrote:

Weird. Still cant find.

$ git fetch golden remote: Counting objects: 5, done. remote: Compressing objects: 100% (5/5), done. remote: Total 5 (delta 0), reused 0 (delta 0), pack-reused 0 Unpacking objects: 100% (5/5), done. From git://github.com/kaldi-asr/kaldi c972c79..0565cce master -> golden/master $ git show d03abac fatal: ambiguous argument 'd03abac': unknown revision or path not in the working tree.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/643#issuecomment-205476661

kkm000 commented 8 years ago

Thanks!

mmmaat commented 8 years ago

After testing, I confirm all is good for me!

danpovey commented 8 years ago

OK closing, thanks.