tikikun commented 10 hours ago

Overall

We can significantly improve the quality of synthetic multi-modal datasets by using Flow Matching with Optimal Transport.

Context

Currently we make use of Autoregressive Model from WhisperSpeech to generate synthetic dataset specifically the t2s model.

Theoretical Details

The T2S model (Text-to-Semantics) predicts sound tokens from text tokens to generate synthetic data. This problem can be framed as:

"Transforming a distribution of text embeddings into synthetic sound token embeddings."

Alternatively, it can be stated as:

We address sequence-to-sequence embeddings generation tasks. Given a source sequence:

$$ w_x = {w_x^1, ..., w_x^m}, \quad \text{of length } M $$

we aim to develop a generative model that produces a target sequence:

$$ w_y = {w_y^1, ..., w_y^n}, \quad \text{of length } N $$

conditioned on the source sequence.

Empirical results, such as those in the F5-TTS paper, demonstrate that flow matching models efficiently solve this problem with high accuracy and low resource requirements. This approach also avoids the inherent issues of autoregressive models when generating synthetic data.

There is a possibility that we can produce novel results using this approach + increase ichigo performance significantly.

Next Steps

[ ] Adapt our dataset to a flow matching framework.
[ ] Develop a flow matching framework for T2S tasks.

hahuyhoang411 commented 6 hours ago

This could be related: https://github.com/dongzhuoyao/flowseq/tree/main

PodsAreAllYouNeed commented 5 hours ago

You can use a continuous flow matching model to train essentially a text-based auto encoder. The specific architecture should probably be conditional flow matching, with the text as the condition. length of generation should be set using something as simple as a words-per-second heuristic The decoder will a the frozen whisper decoder. The goal is self-supervised text-to-text roundtrip through the CFM model and the decoder. No guarantee that this will work at all but it would be damn interesting if it works. I think it has a chance of working because we're distilling the information from the whisper decoder, which is a strong model.

If it works, it means we will be able to train T2S model on all the languages supported by whisper, without the need for any audio data. all we need is some multi-lingual text data.

PodsAreAllYouNeed commented 5 hours ago

Also, check out this repo

https://github.com/lucidrains/voicebox-pytorch

It has a good implementation for the CFM model, relatively easy to read. I've used it before in my work.

It also has some links to Spear-TTS, the precursor to WhisperSpeech. Also E2-TTS and later F5-TTS might have built on top of it.

hahuyhoang411 commented 5 hours ago

Oh he also has: https://github.com/lucidrains/e2-tts-pytorch

thanks lucidrains

janhq / ichigo

research: Possibility of a breakthrough in synthetic data generation using Flow Matching #140

Overall

Context

Theoretical Details

Next Steps