Open kkm000 opened 4 years ago
are you sure you want filter out/check for duplicate strings? I'm not sure if there would be a problem in general, as most of the kaldi cares about the indices only... Not sure -- just asking. y.
On Fri, Nov 20, 2020 at 4:54 AM kkm000 notifications@github.com wrote:
The message from an FST binary doesn't clearly point to the cause. An easy fix, I'll do.
The tool also trusts the index value of #0 being 1 larger than the last phone symbol, which better be checked. We generally try to validate everything user-supplied as much as possible.
I'm thinking of adding a utility script to validate FST symbol tables in general, to make sure a file does not contain duplicate strings or duplicate indexes, that 0 is
and so on. There are a couple places where a more thorough check is done, a couple other where it's half-done, and this one does not do much checking at all. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4344, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX553IWBT3CDA7CF5VTSQY4GZANCNFSM4T4RWKGA .
Yeah, I think duplicate strings should probably be warning rather than an error. We already check for duplicate ids, search for "duplicates" in validate_lang.pl. Where do we require that #0 is 1 larger than the last phone symbol? I don't believe that is a requirement.
On Fri, Nov 20, 2020 at 11:46 PM jtrmal notifications@github.com wrote:
are you sure you want filter out/check for duplicate strings? I'm not sure if there would be a problem in general, as most of the kaldi cares about the indices only... Not sure -- just asking. y.
On Fri, Nov 20, 2020 at 4:54 AM kkm000 notifications@github.com wrote:
The message from an FST binary doesn't clearly point to the cause. An easy fix, I'll do.
The tool also trusts the index value of #0 being 1 larger than the last phone symbol, which better be checked. We generally try to validate everything user-supplied as much as possible.
I'm thinking of adding a utility script to validate FST symbol tables in general, to make sure a file does not contain duplicate strings or duplicate indexes, that 0 is
and so on. There are a couple places where a more thorough check is done, a couple other where it's half-done, and this one does not do much checking at all. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4344, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYX553IWBT3CDA7CF5VTSQY4GZANCNFSM4T4RWKGA
.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4344#issuecomment-731245386, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7CDYKBJU5NZO74QM3SQ2FMHANCNFSM4T4RWKGA .
Yeah, a warning would be fine by me, too.
@danpovey
Where do we require that #0 is 1 larger than the last phone symbol? I don't believe that is a requirement.
We in fact do not, it's not a requirement. It's just how the tool happens to work: greps for the '#0' and uses its index as the base for additional disambiguators, adding 1 for the next if it does not exist. So, as written, '#0' should better follow the symbols. This is only for the case of invoking it with the --phone-symbol-table
switch.
The message from an FST binary doesn't clearly point to the cause. An easy fix, I'll do.
The tool also trusts the index value of
#0
being 1 larger than the last phone symbol, which better be checked. We generally try to validate everything user-supplied as much as possible.I'm thinking of adding a utility script to validate FST symbol tables in general, to make sure a file does not contain duplicate strings or duplicate indexes, that 0 is
<eps>
and so on. There are a couple places where a more thorough check is done, a couple other where it's half-done, and this one does not do much checking at all.