k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
782 stars 264 forks source link

Integrating Phone-Based lang (Lexicon ) into Zipformer Model #1606

Open kerolos opened 1 week ago

kerolos commented 1 week ago

I'm seeking guidance on how to incorporate a Phone-Based Language Lexicon (in icefall/egs/librispeech/ASR/prepare.sh in Step 6) into the latest Zipformer Model, a state-of-the-art solution in speech recognition.

I'm unsure about which parameters need adjustment in the Zipformer Model Architecture to optimize performance specifically for phone-level recognition, rather than sub-word or sentence-piece levels, which are typical in Byte Pair Encoding (BPE) models.

Description: I understand the benefits of open vocabulary systems like BPE, which eliminate the need for prior knowledge of word pronunciation, I'm unsure how BPE handles variations in word pronunciation found in training materials or using words in text training material without normalization them to all lower or upper characters. Additionally, during decoding, there's a possibility of encountering words with multiple variants or specific terminology (such as legal or medical terms or special foreign words) that may contain some tokens do not be in BPE model or not in the token list (tokens.txt)! -How does BPE handle variations in word pronunciation during training and decoding !, What strategies can I use to address the limitations of BPE models when encountering specialized terminology or words with multiple variants during decoding! This might be the drawback of using BEP based lexicon system.

I have few questions: 1-How can I effectively use a Phone-Based Language Lexicon into the Zipformer Model ? And which Zipformer model or recipe shall be used ? 2- Which parameters in the Zipformer Model Architecture (that layers run at different speeds) should be adjusted or tuned to be able to work well with phone level not sub-word level or sentence piece level Byte Pair Encoding (BPE) as this model designed for ?

Also, It would be good for me to compare the old technology using the TDNN Model in original Kaldi to the Zipformer Model in Next-gen Kaldi icefall using Phone-Based lexicon with the same dataset and also in different languages.

Any advice on these questions would be greatly appreciated. Thanks in advance.

wangtiance commented 1 week ago

You may refer to egs/librispeech/ASR/tiny_transducer_ctc on how to incorporate the phone lexicon. Basically you use a UniqLexicon object to convert texts to phone tokens. Note that it doesn't handle multiple pronunciations.

Based on my experience, BPE models have better WER than phone models. I'm looking forward to your results.