Closed bethant9 closed 2 years ago
The constraint to have a unique lexicon is a major drawback for phone level training.
I think such a constraint exists only for BPE based lexicon. For the phone-based lexicon, it indeed supports multiple pronunciations.
Please see
Thank you for your response. I am referring to the Lexicon class which throws an error if a non unique lexicon is provided. This is called by the MMI graph compiler
https://github.com/k2-fsa/icefall/blob/master/icefall/mmi_graph_compiler.py#L34
Ah, I see. The MMI training in icefall is using BPE based lexicon.
Please refer to the MMI training in snowfall (https://github.com/k2-fsa/snowfall/blob/master/egs/librispeech/asr/simple_v1/mmi_att_transformer_train.py), which uses a phone-based lexicon and supports multiple pronunciations. See https://github.com/k2-fsa/snowfall/blob/911198817edc7b306265f32447ef8a7dc5cfa8f2/snowfall/training/mmi_graph.py#L174
Thank you so much, I will look into Snowfall
Although, what about the aishell MMI example? That seems to run on phones, or am I missing something there?
Although, what about the aishell MMI example? That seems to run on phones,
Yes, you are right. That recipe uses only the first pronunciation of a word in the lexicon.
It is based on the MMI recipe from the Librispeech recipe, which uses a BPE-based lexicon and that's why it uses only one pronunciation.
Great thanks. I'm not trying to run the aishell example, I just wanted to check my understanding
We will have to come up with some kind of solution for this. One possibility is to run a simpler kind of system, like CTC, as a kind of alignment pass so we can get the phone-level transcripts, and then train (say) and RNN-T phone-based system, which we could decode with a graph built from a word-based language model.
What's the reason for MMI in k2 not supporting multiple pronunciations? Was it to have a more optimized implementation of the loss?
That's a recipe level limitation, not a fundamental limitation in k2. I believe we have other recipes, in Snowfall at least, that use phones e.g. with CTC, and I'm pretty sure we've tried MMI on such systems. But I'm not sure that we have an example of MMI actually giving an improvement over CTC alone.
Ah I see.
But I'm not sure that we have an example of MMI actually giving an improvement over CTC alone.
Does this statement apply as well to TDNNF models? I remember that in Kaldi you didn't manage with CTC to get results as good as with MMI.
I think the key difference is that in k2 we are not using context-dependent phones (because we aim to simplify these things and not inherit all the complexity of traditional systems). That may be the reason why in Kaldi, LF-MMI was better than CTC. But I suspect the model topology (transformer vs. TDNN) may play a role too.
I tried monophones with CTC vs. LF-MMI when it was being developed in Kaldi in 2016, and LF-MMI was better. I can't remember the details now so there might have been some caveats, but anyway thanks for the input, I'll see what I get now in k2.
There may have been an issue with our experimental setup, as I believe others have published with k2 and got an improvement (perhaps the STC guys not sure). But anyway we are mostly focusing on RNN-T now.
The constraint to have a unique lexicon is a major drawback for phone level training. Frequently a word will have multiple allowed pronunciations, but the current set up only uses the first possible pronunciation. Is there a reason for this constraint? And is there a way to work around this if we want to keep multiple pronunciations? Many thanks!