Matcha synthesised audio prosody does not seems reflective of the paper

Output from TransformerTTS (Fastpitch/fastspeech2 based) :

text: "नमस्ते, मैं बजाज आलियांज़ जनरल इंश्योरेंस की ओर से स्वाति बोल रही हूँ, क्या आप से बात करने के लिए ये समय सही है?"

https://github.com/shivammehta25/Matcha-TTS/assets/70097551/af4b66aa-4069-411e-aec9-5e299360fb56

Output from Matcha-TTS (Speech-rate 0.90) :

text: "नमस्ते, मैं बजाज आलियांज़ जनरल इंश्योरेंस की ओर से स्वाति बोल रही हूँ, क्या आप से बात करने के लिए ये समय सही है."

https://github.com/shivammehta25/Matcha-TTS/assets/70097551/7fb7d48a-1755-4837-8cff-8b7510579c39

Output from Matcha-TTS (Speech-rate 0.90) :

https://github.com/shivammehta25/Matcha-TTS/assets/70097551/ed441008-26a8-4fec-ac67-5cb36388cf3a

please share your opinion.

Matcha synthesised audio prosody does not seems reflective of the paper

When you say "of the paper", are you referring to one or more specific written formulations in the paper on arXiv, or to particular audio examples of Matcha-TTS online?

please share your opinion.

I do not speak any Indic languages, so I cannot well assess the prosody of the examples you shared.

In general, it seems like there are two questions you might be asking, and I am not sure which one in particular you are looking for an answer to:

Question 1: Why might the prosody of English speech from Matcha-TTS differ from that produced by Matcha-TTS in other languages?
- Different datasets will lead to different results. If, for example, the data used to train a system in a particular language contains odd prosodic patterns, the synthesis may produce odd or degraded prosody as a result. (This cannot explain differences between systems trained on the same data, however.)
- The front-end (i.e., text processing) used may work differently well for different languages. Matcha-TTS uses espeak-ng to turn text into a sequence of phones, but does not provide any additional "linguistic" features to my knowledge. In general, the more informative the output of the text processing is about the prosody of an utterance in a given language (something which depends strongly on both the language and the text processor used), the better the output prosody should be.
Question 2: Why might there be differences between the prosody synthesised by Transformer TTS and by Matcha-TTS, even if the two systems were trained on the same data?
- If the Transformer TTS implementation is using different text processing, and perhaps also extracting other features from the input text, that can lead to differences in the prosody of the synthesised speech. The text processing (and its settings) may even differ between different implementations of Transformer TTS.
- The network and training hyperparameters used to train the respective systems may give the prosody produced by one particular trained model an edge over the other in one scenario. Such a finding does not necessarily imply that that approach gives better prosody than the other in general.
- Like Grad-TTS, Matcha-TTS uses a deterministic duration model, so the durations will be identical every time if the same input text is provided. This approach may lead to less natural and less vivid prosody, with a lower amount of duration variation than what is expected in natural speech. My impression is that Transformer TTS does not contain an explicit duration model, but instead generates frames of acoustic features autoregressively. This is slower, but the sequential nature of the synthesis might enable better prosody to be synthesised, especially if there is randomness in the synthesis process (e.g., pre-net dropout as first introduced by Tacotron).

Thanks for your detailed response.

1.) The text front end ( espeak) supports Indic languages well, also the transformer based models leverages just that and no additional features.

2.) I reckon the deterministic nature of duration prediction is a viable explanation for this. May I ask why a stochastic duration model was not used for ddpms like matcha?

by prosody I meant since prosody seems to have the largest effect(?) on synthesis quality, how is it that a large difference in Ljspeech trained on fastspeech2 vs matcha could have been observed in mos.

Could it be that the hyperparams are not that suitable for my dataset? My dataset is 10 hours of speech with another 8 hours of augmented data derived from the orignal.

I am now back from a trip abroad and can respond to your points here.

1.) The text front end ( espeak) supports Indic languages well, also the transformer based models leverages just that and no additional features.

When you say "espeak", do you mean espeak or espeak-ng? My understanding is that they are different systems, with Matcha-TTS using espeak-ng for its text processing. If there is a front end out there that uses espeak, I am not aware of it, but if the Transformer TTS system in question is using that, it could lead to different results.

2.) I reckon the deterministic nature of duration prediction is a viable explanation for this. May I ask why a stochastic duration model was not used

There are several reasons for this:

Key design criteria for Matcha-TTS were to be fast on GPUs (non-autoregressive) and give close-to-natural-sounding speech (as quantified by a listening test). The duration model we use is fast, but may give less vivid prosody due to its deterministic nature. The strongest probabilistic duration models I know of are autoregressive, and are unlikely to offer the speed we were optimising for. My impression is that devising probabilistic duration models that are simultaneously fast and good is still not a solved problem, but I could be wrong.
There is only so much novelty one can cram into a four-page paper submission. In order to create a well-reasoned and easy-to-read submission with enough experiments to convince reviewers of the value of each individual contribution in the work, one cannot improve on everything all at the same time. Some aspects have to be left as future work.

ddpms like matcha?

To be very clear, Matcha-TTS uses OT-CFM, not DDPMs. For one thing, OT-CFM uses ODEs (continuous time) whereas my understanding is that DDPMs are discrete-time stochastic processes.

by prosody I meant since prosody seems to have the largest effect(?) on synthesis quality

Here I must ask what you mean by "quality"? People assign many different meanings to that word, but the answer to your question hinges on what definition you are using in this case.

Do you mean a. segmental signal quality (e.g., CD quality vs. AM radio) or b. "naturalness" (e.g., human-like intonation vs. robotic intonation)? Or do you have a more applied definition of "quality", such as c. "the mean opinion score I get if I ask people to 'please rate the quality of these sentences on a scale from 1 to 5'"?

Measure a is in principle not affected by prosody at all, whereas measure b is strongly affected by it. Measure a can be strongly affected by the vocoder/signal generator used, whereas b is virtually unaffected by it. However, it is difficult to perform listening tests to only measure a in isolation and ignore the effect of b or vice versa.

If you mean option c, I would say that the picture is more complex. I think (without much empirical evidence to back it up) that mispronunciations, if/when present, probably have the biggest impact on MOS ratings, at least on on individual stimuli. Beyond that it gets complicated. The relative effects of segmental quality (a above) and prosody (b above) will depend on the quality and prosodic richness of the database (e.g., read versus spontaneous speech), what question listeners are asked, and much more.

In the Matcha-TTS paper, we asked listeners "How natural does the synthesised speech sound?" This is pretty open to interpretation by each individual listener, but I would expect this question formulation to give more weight to measure b (prosody) relative to a (segmental quality) than if we had asked about the "quality" of the speech instead. However, I cannot quote research on the top of my head to support this belief. Recent research of ours has found that numerical scores on MOS tests differ depending on which of these two questions that are asked, but disentangling the relative effects of different speech properties was not a focus of that work.

how is it that a large difference in Ljspeech trained on fastspeech2 vs matcha could have been observed in mos.

I am not sure what you are asking here. We try to argue in our paper that the stochastic nature of Matcha-TTS is a contributing factor to the improvement, and there are theoretical arguments (and experimental evidence in an early paper I co-authored) showing that there are fundamental limitations to treating TTS as a (deterministic) regression problem.

If, instead, you are expressing disbelief in the magnitude of the numerical difference we found, you can listen to our example audio stimuli from the listening test to see whether or not you personally agree with our listeners that a large difference in mean opinion scores is warranted.

Could it be that the hyperparams are not that suitable for my dataset? My dataset is 10 hours of speech with another 8 hours of augmented data derived from the orignal.

This is difficult for me to answer. Hyperparameter tuning (to the extent it is needed here) has always been a dark art of machine learning, and one that I have little insight into. I think it depends on much more than just the size of the dataset.

@shreyasinghal-17 By editing the code a little bit, you can provide your own character-level durations. It is useful to debug the model. Eg. Get character-level durations using real mel and text alignment (MAS). Then instead of matmul mu_x with predicted frame durations, use the MAS durations. Push that through the decoder and check the quality. If the quality is good enough, we now know for sure that the duration predictor is an issue. I have been thinking of making a AR dp as a seq2seq model with input being the text tokens and output being the expanded text tokens. To reduce the speed further in decoding (cause text token vocab size will affect the linear layer and softmax calculations), I am thinking we can ask the decoder in dp to predict just 2 tokens, "continue" or "change". When "continue", we use the current token, when "change" we jump to next token, all while traversing the input tokens one at a time (taking advantage of the fact that we use MAS, monotonic increasing, will not attend past). I think this kind of dur_pred might still be fast enough. I have realized that the dur_pred is what is causing issues in other models like vits as well.

Hello, sorry I was away due to personal reasons, I suppose you have a lot of answers to your problem, but I cannot point fingers exactly why this could be the case without much troubleshooting. Perhaps, your dataset has too weird variations for some utterances that mean prediction (which FP etc. are doing) is somehow better than stochastic sampling (which 🍵 Matcha-TTS and many other stochastic TTS do). Also, listening to audio examples (Native Hindi speaker here), I believe the 🍵 Matcha-TTS examples have a slower speaking rate, Could you try fine-tuning that? (I am saying this because unnatural breaks in the second part of the sentence क्या आप से बात करने के लिए ये समय सही है?).

Side Note: For me, the most confusing part of this question was the use of TransformerTTS. To my understanding Transformer TTS. It is a different (and autoregressive) architecture which is very different from Fast* based architecture.

I am closing this issue for now, but if you have any more questions please feel free to reopen this and continue the discussion.

shivammehta25 / Matcha-TTS

Matcha synthesised audio prosody does not seems reflective of the paper #23