Open wntg opened 2 months ago
Thanks for your excellent work! I want to ask how the Discrete tokenizer's perform on the ASR?Can you tell me your understand? Thanks!
The most relevant aspect of discrete tokenizers in Automatic Speech Recognition (ASR) tasks is demonstrated in the experiments on the ARCH benchmark using WavTokenizer, as presented in the paper.
In ASR tasks, the primary focus is on the textual content of the speech modality, whereas the inherent acoustic information is not emphasized. Therefore, utilizing semantic tokenizers like Whisper is sufficient for ASR tasks. If discretization is required, directly discretizing semantic tokens is more suitable for ASR tasks. Employing acoustic tokenizers solely for ASR tasks is not ideal. Notably, in the context of end-to-end speech dialogue systems like GPT4-o, the input is not merely the text output from an ASR model. Instead, it requires additional information inherent to the speech modality, such as the speaker's emotions, tone, and style. Consequently, acoustic tokens have a broader application scope in such multi-task large models compared to semantic tokens. Furthermore, in future multi-modal unified large models, acoustic tokenizers can better distinguish themselves from other modalities, representing the speech modality itself.
Regarding semantic tokens in ASR tasks, although numerous efforts have been made to enhance semantic information in acoustic tokenizers, even those that compromise audio and music modeling capabilities for semantic modeling capabilities, their semantic information content remains inferior to the best semantic tokenizers. This is even more pronounced when compared to more elegant semantic enhancement methods, such as WavTokenizer, which is weaker than encoder-based models like HuBERT.
However, we believe that acoustic tokenizers have potential and may eventually match the semantic information content of encoder-based models like HuBERT. The key lies in elegantly enhancing the encoder's capabilities. Notably, in WavTokenizer, we significantly strengthened the decoder, but the encoder is also crucial. One of our objectives is to substantially enhance the encoder's capabilities in WavTokenizer 2, thereby further improving semantic modeling capabilities.
Thank you for your detailed answer. I feel the same way. I hope that one day in the future tokenizer will have good expression both acoustically and semantically. In addition, I have other questions. In the current work, discrete encoding can generate realistic audio. These speeches have textual content, but why can't they express semantics well? I suspect this is biased learning. But that's not necessarily a bad thing. For example, a child may learn to speak first, but may not be able to write. Regarding the deep learnning task, can we use discrete coding to directly conduct voice dialogues skipping the process of converting text?
Thank you for your detailed answer. I feel the same way. I hope that one day in the future tokenizer will have good expression both acoustically and semantically. In addition, I have other questions. In the current work, discrete encoding can generate realistic audio. These speeches have textual content, but why can't they express semantics well? I suspect this is biased learning. But that's not necessarily a bad thing. For example, a child may learn to speak first, but may not be able to write. Regarding the deep learnning task, can we use discrete coding to directly conduct voice dialogues skipping the process of converting text?
Regarding the two new questions raised, our perspectives are as follows:
Thanks for your answer, I learned a lot. I agree with you very much. Regarding the second point, I would like to try to do end-to-end research using encoders such as WavTokenizer. I also look forward to your follow-up work!
Thanks for your excellent work! I want to ask how the Discrete tokenizer's perform on the ASR?Can you tell me your understand? Thanks!