kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.33k stars 5.33k forks source link

utils/prepare_lang.sh --phone-symbol-table crash if the symbol file has no #0 #4344

Open kkm000 opened 4 years ago

kkm000 commented 4 years ago

The message from an FST binary doesn't clearly point to the cause. An easy fix, I'll do.

The tool also trusts the index value of #0 being 1 larger than the last phone symbol, which better be checked. We generally try to validate everything user-supplied as much as possible.

I'm thinking of adding a utility script to validate FST symbol tables in general, to make sure a file does not contain duplicate strings or duplicate indexes, that 0 is <eps> and so on. There are a couple places where a more thorough check is done, a couple other where it's half-done, and this one does not do much checking at all.

jtrmal commented 4 years ago

are you sure you want filter out/check for duplicate strings? I'm not sure if there would be a problem in general, as most of the kaldi cares about the indices only... Not sure -- just asking. y.

On Fri, Nov 20, 2020 at 4:54 AM kkm000 notifications@github.com wrote:

The message from an FST binary doesn't clearly point to the cause. An easy fix, I'll do.

The tool also trusts the index value of #0 being 1 larger than the last phone symbol, which better be checked. We generally try to validate everything user-supplied as much as possible.

I'm thinking of adding a utility script to validate FST symbol tables in general, to make sure a file does not contain duplicate strings or duplicate indexes, that 0 is and so on. There are a couple places where a more thorough check is done, a couple other where it's half-done, and this one does not do much checking at all.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4344, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX553IWBT3CDA7CF5VTSQY4GZANCNFSM4T4RWKGA .

danpovey commented 4 years ago

Yeah, I think duplicate strings should probably be warning rather than an error. We already check for duplicate ids, search for "duplicates" in validate_lang.pl. Where do we require that #0 is 1 larger than the last phone symbol? I don't believe that is a requirement.

On Fri, Nov 20, 2020 at 11:46 PM jtrmal notifications@github.com wrote:

are you sure you want filter out/check for duplicate strings? I'm not sure if there would be a problem in general, as most of the kaldi cares about the indices only... Not sure -- just asking. y.

On Fri, Nov 20, 2020 at 4:54 AM kkm000 notifications@github.com wrote:

The message from an FST binary doesn't clearly point to the cause. An easy fix, I'll do.

The tool also trusts the index value of #0 being 1 larger than the last phone symbol, which better be checked. We generally try to validate everything user-supplied as much as possible.

I'm thinking of adding a utility script to validate FST symbol tables in general, to make sure a file does not contain duplicate strings or duplicate indexes, that 0 is and so on. There are a couple places where a more thorough check is done, a couple other where it's half-done, and this one does not do much checking at all.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4344, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYX553IWBT3CDA7CF5VTSQY4GZANCNFSM4T4RWKGA

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4344#issuecomment-731245386, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7CDYKBJU5NZO74QM3SQ2FMHANCNFSM4T4RWKGA .

kkm000 commented 4 years ago

Yeah, a warning would be fine by me, too.

@danpovey

Where do we require that #0 is 1 larger than the last phone symbol? I don't believe that is a requirement.

We in fact do not, it's not a requirement. It's just how the tool happens to work: greps for the '#0' and uses its index as the base for additional disambiguators, adding 1 for the next if it does not exist. So, as written, '#0' should better follow the symbols. This is only for the case of invoking it with the --phone-symbol-table switch.

https://github.com/kaldi-asr/kaldi/blob/0c6a3dcf0ca2cbd2b7a180183ca7665465d5d042/egs/wsj/s5/utils/prepare_lang.sh#L317-L323