jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
801 stars 44 forks source link

Alignment language vocabulary and speech space #22

Open varfolomeeff opened 2 months ago

varfolomeeff commented 2 months ago

Hello, thanks for the great work, it was interesting! Can you please tell me more about how you align the speech space with the text vocabulary? As far as I understand, you use 200 centroids from the speech space and then align the codebook with the language vocabulary.

jishengpeng commented 2 months ago

Hello, thanks for the great work, it was interesting! Can you please tell me more about how you align the speech space with the text vocabulary? As far as I understand, you use 200 centroids from the speech space and then align the codebook with the language vocabulary.

During the training of the WavTokenizer, no explicit alignment to the text space was performed. Instead, when training the end-to-end Mini-GPT4O model, we leveraged a pre-trained language model to forcibly align the speech modality with the text space.

tanmaylaud commented 2 weeks ago

@jishengpeng what did the training data look like for the text-speech alignment ?

jishengpeng commented 1 week ago

@jishengpeng what did the training data look like for the text-speech alignment ?

Thank you for your interest. The amount of aligned training data depends on various factors, such as the training methods and architectures used. We will provide a detailed analysis of this in the WavChat survey.