Open varfolomeeff opened 2 months ago
Hello, thanks for the great work, it was interesting! Can you please tell me more about how you align the speech space with the text vocabulary? As far as I understand, you use 200 centroids from the speech space and then align the codebook with the language vocabulary.
During the training of the WavTokenizer, no explicit alignment to the text space was performed. Instead, when training the end-to-end Mini-GPT4O model, we leveraged a pre-trained language model to forcibly align the speech modality with the text space.
@jishengpeng what did the training data look like for the text-speech alignment ?
@jishengpeng what did the training data look like for the text-speech alignment ?
Thank you for your interest. The amount of aligned training data depends on various factors, such as the training methods and architectures used. We will provide a detailed analysis of this in the WavChat survey.
Hello, thanks for the great work, it was interesting! Can you please tell me more about how you align the speech space with the text vocabulary? As far as I understand, you use 200 centroids from the speech space and then align the codebook with the language vocabulary.