tikikun commented 4 days ago

Motivation

Since ichigo v0.5 will support additional language that will make the traditional t2s obsolete. This is a good chance to introduce a t2s framework that we have full control over.

Goal

Be able to handle any arbitrary language

Current Ichigo approach (WhisperVQ) is only trained for 7 languages
We cannot find a ASR or STT module for every new language
Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)

Methodology

WhisperSpeech text-to-semantic model failed for our synthetic data pipeline, producing incompatible sound tokens that disrupted Ichigo’s comprehension.
- Proposed solution: a custom decoder-only text-to-semantic model (<3B parameters, similar to Qwen 2.5) with knowledge transfer from WhisperVQ and a more efficient architecture.
- Processed 10k English samples from MLS Eng 10k dataset (2.42M samples) using WhisperVQ for semantic token extraction, adding a <|text_to_semantic|> task token in user turns.
- Example dataset: Instruction data.
Modified Qwen 2.5 0.5B model:
- Introduced <|text_to_semantic|> task token and added 512 sound tokens + 3 special tokens (start, end, mask) to its vocabulary.
- Trained with instruction-based samples for text-to-semantic conversion.
- Embedded control tokens without modifying the tokenizer, scaled embedding layer and LM head to [152,192](https://github.com/QwenLM/Qwen/issues/419) tokens for training speed optimization.

What needed to be done:

[ ] Train t2s on a decoder model to test feasibility (we already have the data) under ichigo v0.4
[ ] Train t2s on decoder model on new quantizer after janhq/WhisperSpeech#2 is done

Experiments

Run ID	Date	Model Config	Dataset	Learning Rate	Batch Size	Steps	Loss	Hardware
exp-t2s-0.5B	2024-11-28	Full-Finetune	Instruction text to sound sementic token	1e-3	96	28810	1.6-1.7	~ 4 hours on 2xH100
exp-t2s-1.5B-1	2024-11-29	Full-Finetune	Instruction text to sound sementic token	1e-3	84	28810	2.64	~ 10 hours on 6xA6000
exp-t2s-1.5B-2	2024-11-30	Full-Finetune	Instruction text to sound sementic token	1e-4	84	28810	1.84	~ 10 hours on 6xA6000
exp-t2s-llama3.2-1B	2024-12-1	Full-Finetune	Instruction text to sound sementic token	1e-4	96	25208	1.73	~ 9 hours on 6xA6000

bachvudinh commented 3 days ago

Goal

Be able to handle any arbitrary language

Current Ichigo approach (WhisperVQ) is only trained for 7 languages
We cannot find a ASR or STT module for every new language
Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)

Methodology

After extensive hyperparameter tuning, the WhisperSpeech text-to-semantic model proved inadequate for our synthetic data pipeline. The model's output resulted in incompatible sound tokens that broke Ichigo's comprehension capabilities. To address this, we propose developing a custom text-to-semantic model based on a decoder-only architecture (similar to Qwen 2.5) with <3B parameters. This model will leverage knowledge transfer from WhisperVQ while maintaining a more efficient architecture that better aligns with our use case.

To develop our custom text-to-semantic model, I processed 10k English samples from the MLS Eng 10k (2.42M samples) dataset by tokenizing the raw speech using WhisperVQ to extract semantic tokens. I also add a special task token <|text_to_semantic|> in user turn. Here is a sample from Instruction data:

[ { "content": "<|text_to_sementic|>he telegraphed to general pemberton that he had learned sherman was between them with four divisions at clinton saying that it was important to reestablish communications that pemberton might be reenforced and directing him to come up in sherman's rear at once", "role": "user" }, { "content": "<|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>", "role": "assistant" } ]

I then modified Qwen 2.5 0.5B by introducing a new task token <|text_to_semantic|> and incorporating 515 sound tokens into its vocabulary. The training data was structured as instruction-based samples designed to teach the model text-to-semantic token conversion. Following Qwen's methodology, we integrated control tokens into the embedding layer without tokenizer modification which, according to Qwen Authors, aim at optimizing training performance, we adjusted the embedding layer and LM head dimensions by a factor of 128, resulting in a final embedding size of 152,192 tokens.

bachvudinh commented 3 days ago

dan-homebrew commented 3 days ago

Need to add more details to this issue:

Goal: Be able to handle any arbitrary language
- Current Ichigo approach (WhisperVQ) is only trained for 7 languages
- We cannot find a ASR or STT module for every new language
- Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)

Please help me to align nomenclature etc. @tikikun's diagram above is very helpful.

hahuyhoang411 commented 1 day ago

I move the table to the top for better visualization cc @bachvudinh

PodsAreAllYouNeed commented 14 minutes ago

This task is a hybrid between Text-to-speech and speech-to-speech translation. It is quite hard because there is a one-to-many mapping between input text, and possible output token combinations.

Here are two papers that are using the same AR setting, but for slightly different tasks. I think it can be adapted.

AudioPALM: https://arxiv.org/pdf/2306.12925 Valle-E: https://arxiv.org/pdf/2301.02111

Specifically, I think we can use Valle-E's idea of using a phoneme conversion layer before sending the text into the AR model, this might bridge the gap to the semantic embeddings abit, making the AR model's job easier. We also need to somehow provide some auxiliary information about the expected acoustic ground-truth that we are using, otherwise, if we provide text-only to the AR model, there are too many possible correct answers, so across multiple samples the loss may conflict.

However, I think it will be hard to make this work. The AR model needs a better constraint.

My proposal

In the WhisperSpeech framework, the text-to-semantic model is the inverse of the whisper decoder. We need to involve the whisper decoder in the training.

1) Keep the same AR model structure 2) However, instead of trying to get the model to predict the whipserVQ codes, send continuous embeddings into the frozen whisper decoder. What we are trying to do is get the AR decoder model to trick the whisper decoder into thinking it is seeing output from the whisper encoder. 3) Compute the loss of the whisper decoder output to the original text.

You will meet a practical challenge, which is that while training this AR decoder model, its acting like its a NAR encoder model to the Whisper Decoder. There might be a smart way to solve this, but I can't think of one at the moment, or you can just use a NAR model.

Another (Simpler) Idea

If we really want an AR model trained using next token prediction, we must use WhisperVQ tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical WhisperVQ tokens together. This way, the model is not penalized for getting the output length wrong.

i.e this original example: <|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>

get mapped to this: <|sound_start|><|sound_0209|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0459|><|sound_0405|><|sound_end|>

This way the order of the token output matters, but the number of consecutively repeated tokens do not matter. We can worry about upsampling the number of tokens as a separate problem. It might not matter to the decoder because the whole token sequence gets cross-attention anyway. The repeated tokens might not be adding that much information. During Fine-tuning we can apply a similar filtering to the WhisperVQ token stream, to see if the performance changes. Actually, if repeated tokens don't impact performance, then actually it makes inference even faster.

janhq / WhisperSpeech

task: Train and test text2semantic under decoder only framework for ichigo v0.5 #3

Motivation

Goal

Methodology

What needed to be done:

Experiments

Goal

Methodology

My proposal

Another (Simpler) Idea