Closed WelkinYang closed 1 year ago
Another questionable point is that the authors of Grad-TTS followed up with a paper with the same motivation as Matcha-TTS: Fast-GradTTS This paper didn't even get any citations in Matcha-TTS , even though Fast-GradTTS was released a year earlier. Looking at the MOS scoring table of Fast-GradTTS, Fast-GradTTS with the above mentioned maximum likelihood based SDE Sovler can reach a MOS of 4.0 in real time on the CPU. However the authors ignore these and only compare Grad-TTS with Euler's method and the results are obvious, Matcha-TTS outperforms Grad-TTS, but this is only part of the truth.
Hello! Thank you very much for your questions. However there are a few misunderstandings here,
First, Regarding the code implementation, we utilized the 'official source code' as mentioned in our paper and followed the recommended settings. https://github.com/huawei-noah/Speech-Backbones/blob/7782c7a223dd25df47b4538698ee9756774e2924/Grad-TTS/model/diffusion.py#L272 We specifically made use of the solver available in the source code, which includes the 'stoc=False' setting and a temperature value of 1.5 (0.667), as per the Grad-TTS paper's defaults and recommendations. It's worth noting that we empirically found that synthesis from solving the SDEs was less optimal, a point also acknowledged in the Grad-TTS paper. Additionally, we used the provided checkpoint, which was trained for 1.7 million iterations, whereas our model was trained for 2 x 500,000 = 1 million iterations. We referred to both as 'Euler' solvers to denote their shared characteristic of being order-one solvers, implying one Neural Function Evaluation (NFE) per timestep. We appreciate your suggestion about possibly using 'NFEs' for clarity. Maybe we should phrase it as NFEs but since it is a relatively new term, we were not sure about its usage and understanding for the masses. But we will think about it :) Interestingly, it's worth noting that Diff-VC pertains to voice conversion, whereas our primary focus for comparison was with (text-to-speech engines) Grad-TTS and its official source code. Consequently, we didn't explore this section of the repository. However, now that you've brought it to our attention, it does appear intriguing. It's worth mentioning that Lipman et al. (https://arxiv.org/pdf/2210.02747.pdf) have shown that OT-CFM follows simpler paths, given its continuous normalising flow nature compared to stochastic diffusion paths, thereby arguing that it might still be better to use OT-CFM (Update: It is better to use OT-CFM check the comment below). However, it's important to clarify that comparing these for TTS was out of the scope of this research. This could be an avenue for future investigations.
Secondly, I think you've only emphasised on only half the motivation, it's important to highlight that Grad-TTS requires substantial GPU resources to train without hacks like spectrogram chopping. We cannot train Grad-TTS on 24 GiB GPU (we used NVIDIA 3090 RTX as mentioned in our paper), with a batch size of 32 without chopping the spectrogram (out_size=172 in the official code), while that is not the case with Matcha-TTS, (needs less than 10 GiB for full spectrogram training). This difference, combined with relative positional embeddings in the text encoder, can affect prosody quality. Additionally, as depicted in Figure 2 of our paper, the 2D convolution-based decoder in Grad-TTS scales badly with longer utterances, which influenced our design choices for Matcha-TTS so we minimise this tradeoff between quality and speed.
Third, I do not feel comfortable treating MOS as a single absolute metric to optimise for, it is subjective and depends largely on the experimentation setup/ demographics, what scale/choices were used, etc. It is explained in great detail in (https://arxiv.org/pdf/2306.02044.pdf) and (https://openreview.net/pdf?id=bnVBXj-mOc). As stated in our paper, our evaluation centered around naturalness, where spectrogram chopping and relative positional embeddings impacted prosody quality. This effect was particularly noticeable as we evaluated with self-reported native English speakers, who are more sensitive to these differences compared to samples from the pool of all English speakers, as noted in Chiang et al.'s work (the first link).
Fourth, regarding Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU, as the title suggests it argues about its speed on CPU and it is a different research motivation altogether and does not have an official source code available. As stated in our paper, all experiments were performed on GPUs. We did not formulate our experiments around synthesis on CPU. However, empirically talking if you will run the model on CPU (which the huggingface's space does) it would still be quite fast because it is inherently fast, because of both the mathematical formulation and the practical architecture. However, we agree this citation could be relevant, which we may consider during the peer review process.
So finally, it is the architecture + OT-CFM (similar to rectified flow matching) that brings in the improvements in both prosody and synthesis speed.
So, overall we would argue that the comparison was fair.
Hope this answers your questions and thank you so much for taking the time to read about our work.
Additionally, I tried using the ML-based fast sampler to assess its efficacy. Could you confirm this is the approach you are referring to. The results, focusing on 1 and 2 n_timesteps are attached. These experiments were conducted using sentences available at Matcha-TTS's demo webpage, which can be accessed at: https://shivammehta25.github.io/Matcha-TTS/#effect-of-the-number-of-ode-solver-steps
Based on these findings, we contend that Matcha-TTS consistently produces improved results with fewer iterations, underscoring its effectiveness and efficiency. We hope this clears further doubts. Further, we would like to thank you for pointing in the direction of maximum likelihood-based sampler.
Raw waveforms are here: https://github.com/shivammehta25/Grad-TTS_Repo/tree/fast_ml_sampler/Grad-TTS/github_comment
To upload them to GitHub comments I had to convert them to .mov
. So here
https://github.com/shivammehta25/Matcha-TTS/assets/9089131/88bf3e2a-d449-42b7-ab5b-dc4e8445366f
https://github.com/shivammehta25/Matcha-TTS/assets/9089131/bca85ac6-3978-4dc5-8fce-10bee3600b59
https://github.com/shivammehta25/Matcha-TTS/assets/9089131/2484fac0-099f-4181-bca7-4162f1befa06
https://github.com/shivammehta25/Matcha-TTS/assets/9089131/05a194d1-4a25-49da-befa-449a98ec0485
https://github.com/shivammehta25/Matcha-TTS/assets/9089131/ec91f350-0729-400c-93cc-12aefbf64023
https://github.com/shivammehta25/Matcha-TTS/assets/9089131/0989e247-6197-4924-b286-e1f0a1789cac
The comparison of Grad-TTS in the paper is unfair.
I disagree with this assertion. We are comparing to the latest official version of Grad-TTS, which is the one described in the paper we cite.
It is worth noting that the code for this SDE Solver is open source under the same repository as Grad-TTS (It is implemented in https://github.com/huawei-noah/Speech-Backbones/blob/main/DiffVC/model/diffusion.py). The pre-trained model of Grad-TTS was used in the paper
The official Grad-TTS source code in the repository in question not include the maximum-likelihood SDE solver. That solver is implemented for the DiffVC voice conversion system, but not for Grad-TTS. What we are using is thus the latest version of Grad-TTS.
it is unlikely that the authors did not notice this fact.
Since we are doing TTS, we did not look at the voice conversion parts in the repository in question, so we were, in fact, unaware.
the authors of Grad-TTS followed up with a paper with the same motivation as Matcha-TTS: Fast-GradTTS This paper didn't even get any citations in Matcha-TTS , even though Fast-GradTTS was released a year earlier.
Fast Grad-TTS appears to focus on fast synthesis on CPU, which is related but not identical to speed on GPU. Unfortunately, it does not appear to have code or pre-trained models available. The official Fast Grad-TTS paper links to https://fast-grad-tts.github.io/, which as of today states "code will be made publicly available shortly".
Looking at the MOS scoring table of Fast-GradTTS, Fast-GradTTS with the above mentioned maximum likelihood based SDE Sovler can reach a MOS of 4.0 in real time on the CPU.
It is not possible to talk about MOS values as absolute quantities like this. We give several references in our preprint that explain why not (and would have cited more if space had not been so tight).
For a demonstration of the hazards of direct comparison, compare the 4.11±0.07 MOS of the Glow-TTS system in the original Grad-TTS paper to the 3.35±0.10 MOS achieved by the same system in the Fast Grad-TTS paper listening test. Note that the confidence intervals are non-overlapping and far apart, despite being for the exact same system.
Hopefully, this tells you the "full truth". I am closing this issue for now, feel free to reopen it in case you want to continue the discussion. Thank you very much for such an interesting discussion. :)
the authors of Grad-TTS followed up with a paper with the same motivation as Matcha-TTS: Fast-GradTTS This paper didn't even get any citations in Matcha-TTS
Our recently updated arXiv preprint (which also is the camera-ready version of the Matcha-TTS paper due to appear at IEEE ICASSP) now cites the Fast Grad-TTS Interspeech paper. Thank you for letting us know about that work.
The comparison of Grad-TTS in the paper is unfair. The authors of Grad-TTS have published a follow-up paper which includes a maximum likelihood based SDE Solver (from Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme (https://arxiv.org/abs/2109.13821). This SDE Solver can improve the performance with a small number of decoding steps. It is worth noting that the code for this SDE Solver is open source under the same repository as Grad-TTS (It is implemented in https://github.com/huawei-noah/Speech-Backbones/blob/main/DiffVC/model/diffusion.py). The pre-trained model of Grad-TTS was used in the paper, and it is unlikely that the authors did not notice this fact. However, the paper ended up using Euler's method for the discretization of the Grad-TTS inverse process.
Here's a tweet from the Grad-TTS authors urging people not to be using Euler's method for the Grad-TTS inverse process: