thu-ml / Bridge-TTS

Official codebase for "Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis" (https://arxiv.org/abs/2312.03491).
MIT License
120 stars 1 forks source link

Interesting Work. Hope for releasing of code/model. #4

Open ChangeFWorld opened 5 months ago

ChangeFWorld commented 5 months ago

I'm curious about if this method can generate different samples? since the text-distribution is fixed, it is likely to generate same samples if the text input doesn't change.

ChangeFWorld commented 4 months ago

Any plan of code release?

rubbybbs commented 2 months ago

Hi, thank you for your interest in our work, and apology for the late reply. In practice, we observe that sampling with Bridge SDE generally produces the same level of diversity as Grad-TTS. Although the diversity of both methods is subtle and mainly exists in terms of voice quality (e.g., whether there is an artifact) rather than variations in tempo of speech or prosody. In my opinion, this phenomenon is largely due to the one-to-one mapping of the training data they used, and unfixing the text distribution like Grad-TTS can not fundamentally address the diversity problem.

On the other hand, the technique of fixing the text distribution is not necessary in Bridge-TTS. It is a flexible practical choice that can be changed over different tasks/datasets.

Regarding the code, we plan to release it upon acceptance. Sorry for the inconvenience.