janhq / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.
https://collabora.github.io/WhisperSpeech/
MIT License
1 stars 0 forks source link

task: Train and test text2semantic under decoder only framework for ichigo v0.5 #3

Open tikikun opened 4 days ago

tikikun commented 4 days ago

Motivation

Since ichigo v0.5 will support additional language that will make the traditional t2s obsolete. This is a good chance to introduce a t2s framework that we have full control over.

Goal

Be able to handle any arbitrary language

Methodology

What needed to be done:

Experiments

Run ID Date Model Config Dataset Learning Rate Batch Size Steps Loss Hardware
exp-t2s-0.5B 2024-11-28 Full-Finetune Instruction text to sound sementic token 1e-3 96 28810 1.6-1.7 ~ 4 hours on 2xH100
exp-t2s-1.5B-1 2024-11-29 Full-Finetune Instruction text to sound sementic token 1e-3 84 28810 2.64 ~ 10 hours on 6xA6000
exp-t2s-1.5B-2 2024-11-30 Full-Finetune Instruction text to sound sementic token 1e-4 84 28810 1.84 ~ 10 hours on 6xA6000
exp-t2s-llama3.2-1B 2024-12-1 Full-Finetune Instruction text to sound sementic token 1e-4 96 25208 1.73 ~ 9 hours on 6xA6000
bachvudinh commented 3 days ago

Goal

Be able to handle any arbitrary language

Methodology

bachvudinh commented 3 days ago

image

dan-homebrew commented 3 days ago

Need to add more details to this issue:

Please help me to align nomenclature etc. @tikikun's diagram above is very helpful.

hahuyhoang411 commented 1 day ago

I move the table to the top for better visualization cc @bachvudinh

PodsAreAllYouNeed commented 14 minutes ago

This task is a hybrid between Text-to-speech and speech-to-speech translation. It is quite hard because there is a one-to-many mapping between input text, and possible output token combinations.

Here are two papers that are using the same AR setting, but for slightly different tasks. I think it can be adapted.

AudioPALM: https://arxiv.org/pdf/2306.12925 Valle-E: https://arxiv.org/pdf/2301.02111

Specifically, I think we can use Valle-E's idea of using a phoneme conversion layer before sending the text into the AR model, this might bridge the gap to the semantic embeddings abit, making the AR model's job easier. We also need to somehow provide some auxiliary information about the expected acoustic ground-truth that we are using, otherwise, if we provide text-only to the AR model, there are too many possible correct answers, so across multiple samples the loss may conflict.

However, I think it will be hard to make this work. The AR model needs a better constraint.

My proposal

In the WhisperSpeech framework, the text-to-semantic model is the inverse of the whisper decoder. We need to involve the whisper decoder in the training.

1) Keep the same AR model structure 2) However, instead of trying to get the model to predict the whipserVQ codes, send continuous embeddings into the frozen whisper decoder. What we are trying to do is get the AR decoder model to trick the whisper decoder into thinking it is seeing output from the whisper encoder. 3) Compute the loss of the whisper decoder output to the original text.

You will meet a practical challenge, which is that while training this AR decoder model, its acting like its a NAR encoder model to the Whisper Decoder. There might be a smart way to solve this, but I can't think of one at the moment, or you can just use a NAR model.

Another (Simpler) Idea

If we really want an AR model trained using next token prediction, we must use WhisperVQ tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical WhisperVQ tokens together. This way, the model is not penalized for getting the output length wrong.

i.e this original example: <|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>

get mapped to this: <|sound_start|><|sound_0209|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0459|><|sound_0405|><|sound_end|>

This way the order of the token output matters, but the number of consecutively repeated tokens do not matter. We can worry about upsampling the number of tokens as a separate problem. It might not matter to the decoder because the whole token sequence gets cross-attention anyway. The repeated tokens might not be adding that much information. During Fine-tuning we can apply a similar filtering to the WhisperVQ token stream, to see if the performance changes. Actually, if repeated tokens don't impact performance, then actually it makes inference even faster.