k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
913 stars 293 forks source link

Unique lexicon constraint #306

Closed bethant9 closed 2 years ago

bethant9 commented 2 years ago

The constraint to have a unique lexicon is a major drawback for phone level training. Frequently a word will have multiple allowed pronunciations, but the current set up only uses the first possible pronunciation. Is there a reason for this constraint? And is there a way to work around this if we want to keep multiple pronunciations? Many thanks!

csukuangfj commented 2 years ago

The constraint to have a unique lexicon is a major drawback for phone level training.

I think such a constraint exists only for BPE based lexicon. For the phone-based lexicon, it indeed supports multiple pronunciations.

Please see

https://github.com/k2-fsa/icefall/blob/8cb727e24a349538b2e43fbc63cedd05c6f8f2da/egs/librispeech/ASR/prepare.sh#L141-L143

https://github.com/k2-fsa/icefall/blob/8cb727e24a349538b2e43fbc63cedd05c6f8f2da/egs/librispeech/ASR/tdnn_lstm_ctc/train.py#L320

https://github.com/k2-fsa/icefall/blob/8cb727e24a349538b2e43fbc63cedd05c6f8f2da/icefall/graph_compiler.py#L74

https://github.com/k2-fsa/icefall/blob/8cb727e24a349538b2e43fbc63cedd05c6f8f2da/icefall/graph_compiler.py#L139-L145

bethant9 commented 2 years ago

Thank you for your response. I am referring to the Lexicon class which throws an error if a non unique lexicon is provided. This is called by the MMI graph compiler

https://github.com/k2-fsa/icefall/blob/78b8792d1d3b15008378b0e38d533a77b456bbbd/icefall/lexicon.py#L94

https://github.com/k2-fsa/icefall/blob/master/icefall/mmi_graph_compiler.py#L34

csukuangfj commented 2 years ago

Ah, I see. The MMI training in icefall is using BPE based lexicon.

Please refer to the MMI training in snowfall (https://github.com/k2-fsa/snowfall/blob/master/egs/librispeech/asr/simple_v1/mmi_att_transformer_train.py), which uses a phone-based lexicon and supports multiple pronunciations. See https://github.com/k2-fsa/snowfall/blob/911198817edc7b306265f32447ef8a7dc5cfa8f2/snowfall/training/mmi_graph.py#L174

bethant9 commented 2 years ago

Thank you so much, I will look into Snowfall

Although, what about the aishell MMI example? That seems to run on phones, or am I missing something there?

csukuangfj commented 2 years ago

Although, what about the aishell MMI example? That seems to run on phones,

Yes, you are right. That recipe uses only the first pronunciation of a word in the lexicon.

https://github.com/k2-fsa/icefall/blob/118e195004ef41f07a26018fff87ee79acea9d31/egs/aishell/ASR/conformer_mmi/train.py#L576

https://github.com/k2-fsa/icefall/blob/118e195004ef41f07a26018fff87ee79acea9d31/icefall/mmi_graph_compiler.py#L12-L20

It is based on the MMI recipe from the Librispeech recipe, which uses a BPE-based lexicon and that's why it uses only one pronunciation.

bethant9 commented 2 years ago

Great thanks. I'm not trying to run the aishell example, I just wanted to check my understanding

danpovey commented 2 years ago

We will have to come up with some kind of solution for this. One possibility is to run a simpler kind of system, like CTC, as a kind of alignment pass so we can get the phone-level transcripts, and then train (say) and RNN-T phone-based system, which we could decode with a graph built from a word-based language model.

francisr commented 2 years ago

What's the reason for MMI in k2 not supporting multiple pronunciations? Was it to have a more optimized implementation of the loss?

danpovey commented 2 years ago

That's a recipe level limitation, not a fundamental limitation in k2. I believe we have other recipes, in Snowfall at least, that use phones e.g. with CTC, and I'm pretty sure we've tried MMI on such systems. But I'm not sure that we have an example of MMI actually giving an improvement over CTC alone.

francisr commented 2 years ago

Ah I see.

But I'm not sure that we have an example of MMI actually giving an improvement over CTC alone.

Does this statement apply as well to TDNNF models? I remember that in Kaldi you didn't manage with CTC to get results as good as with MMI.

danpovey commented 2 years ago

I think the key difference is that in k2 we are not using context-dependent phones (because we aim to simplify these things and not inherit all the complexity of traditional systems). That may be the reason why in Kaldi, LF-MMI was better than CTC. But I suspect the model topology (transformer vs. TDNN) may play a role too.

francisr commented 2 years ago

I tried monophones with CTC vs. LF-MMI when it was being developed in Kaldi in 2016, and LF-MMI was better. I can't remember the details now so there might have been some caveats, but anyway thanks for the input, I'll see what I get now in k2.

danpovey commented 2 years ago

There may have been an issue with our experimental setup, as I believe others have published with k2 and got an improvement (perhaps the STC guys not sure). But anyway we are mostly focusing on RNN-T now.