jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
797 stars 43 forks source link

About ASR #6

Open wntg opened 2 months ago

wntg commented 2 months ago

Thanks for your excellent work! I want to ask how the Discrete tokenizer's perform on the ASR?Can you tell me your understand? Thanks!

jishengpeng commented 2 months ago

Thanks for your excellent work! I want to ask how the Discrete tokenizer's perform on the ASR?Can you tell me your understand? Thanks!

The most relevant aspect of discrete tokenizers in Automatic Speech Recognition (ASR) tasks is demonstrated in the experiments on the ARCH benchmark using WavTokenizer, as presented in the paper.

  1. In ASR tasks, the primary focus is on the textual content of the speech modality, whereas the inherent acoustic information is not emphasized. Therefore, utilizing semantic tokenizers like Whisper is sufficient for ASR tasks. If discretization is required, directly discretizing semantic tokens is more suitable for ASR tasks. Employing acoustic tokenizers solely for ASR tasks is not ideal. Notably, in the context of end-to-end speech dialogue systems like GPT4-o, the input is not merely the text output from an ASR model. Instead, it requires additional information inherent to the speech modality, such as the speaker's emotions, tone, and style. Consequently, acoustic tokens have a broader application scope in such multi-task large models compared to semantic tokens. Furthermore, in future multi-modal unified large models, acoustic tokenizers can better distinguish themselves from other modalities, representing the speech modality itself.

  2. Regarding semantic tokens in ASR tasks, although numerous efforts have been made to enhance semantic information in acoustic tokenizers, even those that compromise audio and music modeling capabilities for semantic modeling capabilities, their semantic information content remains inferior to the best semantic tokenizers. This is even more pronounced when compared to more elegant semantic enhancement methods, such as WavTokenizer, which is weaker than encoder-based models like HuBERT.

  3. However, we believe that acoustic tokenizers have potential and may eventually match the semantic information content of encoder-based models like HuBERT. The key lies in elegantly enhancing the encoder's capabilities. Notably, in WavTokenizer, we significantly strengthened the decoder, but the encoder is also crucial. One of our objectives is to substantially enhance the encoder's capabilities in WavTokenizer 2, thereby further improving semantic modeling capabilities.

wntg commented 2 months ago

Thank you for your detailed answer. I feel the same way. I hope that one day in the future tokenizer will have good expression both acoustically and semantically. In addition, I have other questions. In the current work, discrete encoding can generate realistic audio. These speeches have textual content, but why can't they express semantics well? I suspect this is biased learning. But that's not necessarily a bad thing. For example, a child may learn to speak first, but may not be able to write. Regarding the deep learnning task, can we use discrete coding to directly conduct voice dialogues skipping the process of converting text?

jishengpeng commented 2 months ago

Thank you for your detailed answer. I feel the same way. I hope that one day in the future tokenizer will have good expression both acoustically and semantically. In addition, I have other questions. In the current work, discrete encoding can generate realistic audio. These speeches have textual content, but why can't they express semantics well? I suspect this is biased learning. But that's not necessarily a bad thing. For example, a child may learn to speak first, but may not be able to write. Regarding the deep learnning task, can we use discrete coding to directly conduct voice dialogues skipping the process of converting text?

Regarding the two new questions raised, our perspectives are as follows:

  1. On the question of why the reconstruction quality is good but the semantic content is not particularly rich, there are a few possible explanations. One is that the encoder of the acoustic codec may have multiple training objectives(semantic and acoustic), so even if it contains semantic information, there could be issues with information fusion and interference. Another factor is that Whisper uses a single encoder structure, whereas codec models derive their strong reconstruction capability from the combined encoder-decoder architecture. The current strong reconstruction performance is thus most likely attributable to the powerful decoder, so future efforts should focus on further strengthening the encoder.
  2. For the GPT-4o dialogue system and subsequent multimodal large language models, my personal view is that the developmental trajectory will involve a progression from ASR + LLM + TTS in a text-only cascade, then transitioning to an implicit cascade using latent embeddings, and finally moving towards a direct end-to-end approach utilizing codec representations extracted by a WavTokenizer as input to generate target codec outputs. The ultimate goal would be to leverage the inherent tokenizers of the various modalities to enable truly end-to-end training across arbitrary modalities.
wntg commented 2 months ago

Thanks for your answer, I learned a lot. I agree with you very much. Regarding the second point, I would like to try to do end-to-end research using encoders such as WavTokenizer. I also look forward to your follow-up work!