shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
https://shivammehta25.github.io/Matcha-TTS/
MIT License
734 stars 91 forks source link

Information regarding Comparison with Grad-TTS #1

Closed WelkinYang closed 1 year ago

WelkinYang commented 1 year ago

The comparison of Grad-TTS in the paper is unfair. The authors of Grad-TTS have published a follow-up paper which includes a maximum likelihood based SDE Solver (from Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme (https://arxiv.org/abs/2109.13821). This SDE Solver can improve the performance with a small number of decoding steps. It is worth noting that the code for this SDE Solver is open source under the same repository as Grad-TTS (It is implemented in https://github.com/huawei-noah/Speech-Backbones/blob/main/DiffVC/model/diffusion.py). The pre-trained model of Grad-TTS was used in the paper, and it is unlikely that the authors did not notice this fact. However, the paper ended up using Euler's method for the discretization of the Grad-TTS inverse process.

Here's a tweet from the Grad-TTS authors urging people not to be using Euler's method for the Grad-TTS inverse process: image

WelkinYang commented 1 year ago

Another questionable point is that the authors of Grad-TTS followed up with a paper with the same motivation as Matcha-TTS: Fast-GradTTS This paper didn't even get any citations in Matcha-TTS , even though Fast-GradTTS was released a year earlier. Looking at the MOS scoring table of Fast-GradTTS, Fast-GradTTS with the above mentioned maximum likelihood based SDE Sovler can reach a MOS of 4.0 in real time on the CPU. However the authors ignore these and only compare Grad-TTS with Euler's method and the results are obvious, Matcha-TTS outperforms Grad-TTS, but this is only part of the truth.

shivammehta25 commented 1 year ago

Hello! Thank you very much for your questions. However there are a few misunderstandings here,

So finally, it is the architecture + OT-CFM (similar to rectified flow matching) that brings in the improvements in both prosody and synthesis speed.

So, overall we would argue that the comparison was fair.
Hope this answers your questions and thank you so much for taking the time to read about our work.

shivammehta25 commented 1 year ago

Additionally, I tried using the ML-based fast sampler to assess its efficacy. Could you confirm this is the approach you are referring to. The results, focusing on 1 and 2 n_timesteps are attached. These experiments were conducted using sentences available at Matcha-TTS's demo webpage, which can be accessed at: https://shivammehta25.github.io/Matcha-TTS/#effect-of-the-number-of-ode-solver-steps

Based on these findings, we contend that Matcha-TTS consistently produces improved results with fewer iterations, underscoring its effectiveness and efficiency. We hope this clears further doubts. Further, we would like to thank you for pointing in the direction of maximum likelihood-based sampler.

Raw waveforms are here: https://github.com/shivammehta25/Grad-TTS_Repo/tree/fast_ml_sampler/Grad-TTS/github_comment

To upload them to GitHub comments I had to convert them to .mov. So here

https://github.com/shivammehta25/Matcha-TTS/assets/9089131/88bf3e2a-d449-42b7-ab5b-dc4e8445366f

https://github.com/shivammehta25/Matcha-TTS/assets/9089131/bca85ac6-3978-4dc5-8fce-10bee3600b59

https://github.com/shivammehta25/Matcha-TTS/assets/9089131/2484fac0-099f-4181-bca7-4162f1befa06

https://github.com/shivammehta25/Matcha-TTS/assets/9089131/05a194d1-4a25-49da-befa-449a98ec0485

https://github.com/shivammehta25/Matcha-TTS/assets/9089131/ec91f350-0729-400c-93cc-12aefbf64023

https://github.com/shivammehta25/Matcha-TTS/assets/9089131/0989e247-6197-4924-b286-e1f0a1789cac

ghenter commented 1 year ago

The comparison of Grad-TTS in the paper is unfair.

I disagree with this assertion. We are comparing to the latest official version of Grad-TTS, which is the one described in the paper we cite.

It is worth noting that the code for this SDE Solver is open source under the same repository as Grad-TTS (It is implemented in https://github.com/huawei-noah/Speech-Backbones/blob/main/DiffVC/model/diffusion.py). The pre-trained model of Grad-TTS was used in the paper

The official Grad-TTS source code in the repository in question not include the maximum-likelihood SDE solver. That solver is implemented for the DiffVC voice conversion system, but not for Grad-TTS. What we are using is thus the latest version of Grad-TTS.

it is unlikely that the authors did not notice this fact.

Since we are doing TTS, we did not look at the voice conversion parts in the repository in question, so we were, in fact, unaware.

the authors of Grad-TTS followed up with a paper with the same motivation as Matcha-TTS: Fast-GradTTS This paper didn't even get any citations in Matcha-TTS , even though Fast-GradTTS was released a year earlier.

Fast Grad-TTS appears to focus on fast synthesis on CPU, which is related but not identical to speed on GPU. Unfortunately, it does not appear to have code or pre-trained models available. The official Fast Grad-TTS paper links to https://fast-grad-tts.github.io/, which as of today states "code will be made publicly available shortly".

Looking at the MOS scoring table of Fast-GradTTS, Fast-GradTTS with the above mentioned maximum likelihood based SDE Sovler can reach a MOS of 4.0 in real time on the CPU.

It is not possible to talk about MOS values as absolute quantities like this. We give several references in our preprint that explain why not (and would have cited more if space had not been so tight).

For a demonstration of the hazards of direct comparison, compare the 4.11±0.07 MOS of the Glow-TTS system in the original Grad-TTS paper to the 3.35±0.10 MOS achieved by the same system in the Fast Grad-TTS paper listening test. Note that the confidence intervals are non-overlapping and far apart, despite being for the exact same system.

shivammehta25 commented 1 year ago

Hopefully, this tells you the "full truth". I am closing this issue for now, feel free to reopen it in case you want to continue the discussion. Thank you very much for such an interesting discussion. :)

ghenter commented 10 months ago

the authors of Grad-TTS followed up with a paper with the same motivation as Matcha-TTS: Fast-GradTTS This paper didn't even get any citations in Matcha-TTS

Our recently updated arXiv preprint (which also is the camera-ready version of the Matcha-TTS paper due to appear at IEEE ICASSP) now cites the Fast Grad-TTS Interspeech paper. Thank you for letting us know about that work.