Open tikikun opened 4 days ago
Be able to handle any arbitrary language
[ { "content": "<|text_to_sementic|>he telegraphed to general pemberton that he had learned sherman was between them with four divisions at clinton saying that it was important to reestablish communications that pemberton might be reenforced and directing him to come up in sherman's rear at once", "role": "user" }, { "content": "<|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>", "role": "assistant" } ]
Need to add more details to this issue:
Please help me to align nomenclature etc. @tikikun's diagram above is very helpful.
I move the table to the top for better visualization cc @bachvudinh
This task is a hybrid between Text-to-speech and speech-to-speech translation. It is quite hard because there is a one-to-many mapping between input text, and possible output token combinations.
Here are two papers that are using the same AR setting, but for slightly different tasks. I think it can be adapted.
AudioPALM: https://arxiv.org/pdf/2306.12925 Valle-E: https://arxiv.org/pdf/2301.02111
Specifically, I think we can use Valle-E's idea of using a phoneme conversion layer before sending the text into the AR model, this might bridge the gap to the semantic embeddings abit, making the AR model's job easier. We also need to somehow provide some auxiliary information about the expected acoustic ground-truth that we are using, otherwise, if we provide text-only to the AR model, there are too many possible correct answers, so across multiple samples the loss may conflict.
However, I think it will be hard to make this work. The AR model needs a better constraint.
In the WhisperSpeech framework, the text-to-semantic model is the inverse of the whisper decoder. We need to involve the whisper decoder in the training.
1) Keep the same AR model structure 2) However, instead of trying to get the model to predict the whipserVQ codes, send continuous embeddings into the frozen whisper decoder. What we are trying to do is get the AR decoder model to trick the whisper decoder into thinking it is seeing output from the whisper encoder. 3) Compute the loss of the whisper decoder output to the original text.
You will meet a practical challenge, which is that while training this AR decoder model, its acting like its a NAR encoder model to the Whisper Decoder. There might be a smart way to solve this, but I can't think of one at the moment, or you can just use a NAR model.
If we really want an AR model trained using next token prediction, we must use WhisperVQ tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical WhisperVQ tokens together. This way, the model is not penalized for getting the output length wrong.
i.e this original example: <|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>
get mapped to this: <|sound_start|><|sound_0209|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0459|><|sound_0405|><|sound_end|>
This way the order of the token output matters, but the number of consecutively repeated tokens do not matter. We can worry about upsampling the number of tokens as a separate problem. It might not matter to the decoder because the whole token sequence gets cross-attention anyway. The repeated tokens might not be adding that much information. During Fine-tuning we can apply a similar filtering to the WhisperVQ token stream, to see if the performance changes. Actually, if repeated tokens don't impact performance, then actually it makes inference even faster.
Motivation
Since ichigo v0.5 will support additional language that will make the traditional t2s obsolete. This is a good chance to introduce a t2s framework that we have full control over.
Goal
Be able to handle any arbitrary language
Methodology
<|text_to_semantic|>
task token and added 512 sound tokens + 3 special tokens (start, end, mask) to its vocabulary.[152,192](https://github.com/QwenLM/Qwen/issues/419)
tokens for training speed optimization.What needed to be done:
Experiments