kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.11k stars 5.31k forks source link

Make word alignment optional #4802

Closed galv closed 1 year ago

galv commented 1 year ago

For CTC models using word pieces or graphemes, there is not enough positional information to use the word alignment.

I tried marking every unit as "singleton" word_boundary.txt, but this explodes the state space very, very often. See:

https://github.com/nvidia-riva/riva-asrlib-decoder/issues/3

With the "_" character in CTC models predicting word pieces, we at the very least know which word pieces begin a word and which ones are either in the middle of the word or the end of a word, but the algorithm would still need to be rewritten, especially since "blank" is not a silence phoneme (it can appear between).

I did look into using the lexicon-based word alignment. I don't have a specific complaint about it, but I did get a weird error where it couldn't create a final state at all in the output lattice, which caused Connect() to output an empty lattice. This may be because I wasn't quite sure how to handle the blank token. I treat it as its own phoneme, because of limitations in TransitionInformation, but this doesn't really make any sense.

Needless to say, while the CTM outputs of the cuda decoder will be correct from a WER point of view, their time stamps won't be correct, but they probably never were in the first place, for CTC models.

galv commented 1 year ago

Friendly ping on this @jtrmal . This is a backwards compatible change and I have several more coming that will drastically speed up this code.

jtrmal commented 1 year ago

LGTM