shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
https://shivammehta25.github.io/Matcha-TTS/
MIT License
620 stars 78 forks source link

Matcha synthesised audio prosody does not seems reflective of the paper #23

Closed shreyasinghal-17 closed 10 months ago

shreyasinghal-17 commented 11 months ago

Output from TransformerTTS (Fastpitch/fastspeech2 based) :

text: "नमस्ते, मैं बजाज आलियांज़ जनरल इंश्योरेंस की ओर से स्वाति बोल रही हूँ, क्या आप से बात करने के लिए ये समय सही है?"

https://github.com/shivammehta25/Matcha-TTS/assets/70097551/af4b66aa-4069-411e-aec9-5e299360fb56

Output from Matcha-TTS (Speech-rate 0.90) :

text: "नमस्ते, मैं बजाज आलियांज़ जनरल इंश्योरेंस की ओर से स्वाति बोल रही हूँ, क्या आप से बात करने के लिए ये समय सही है."

https://github.com/shivammehta25/Matcha-TTS/assets/70097551/7fb7d48a-1755-4837-8cff-8b7510579c39

Output from Matcha-TTS (Speech-rate 0.90) :

text: "नमस्ते, मैं बजाज आलियांज़ जनरल इंश्योरेंस की ओर से स्वाति बोल रही हूँ, क्या आप से बात करने के लिए ये समय सही है?"

https://github.com/shivammehta25/Matcha-TTS/assets/70097551/ed441008-26a8-4fec-ac67-5cb36388cf3a

please share your opinion.

ghenter commented 11 months ago

Matcha synthesised audio prosody does not seems reflective of the paper

When you say "of the paper", are you referring to one or more specific written formulations in the paper on arXiv, or to particular audio examples of Matcha-TTS online?

please share your opinion.

I do not speak any Indic languages, so I cannot well assess the prosody of the examples you shared.

In general, it seems like there are two questions you might be asking, and I am not sure which one in particular you are looking for an answer to:

shreyasinghal-17 commented 10 months ago

Thanks for your detailed response.

1.) The text front end ( espeak) supports Indic languages well, also the transformer based models leverages just that and no additional features.

2.) I reckon the deterministic nature of duration prediction is a viable explanation for this. May I ask why a stochastic duration model was not used for ddpms like matcha?

by prosody I meant since prosody seems to have the largest effect(?) on synthesis quality, how is it that a large difference in Ljspeech trained on fastspeech2 vs matcha could have been observed in mos.

Could it be that the hyperparams are not that suitable for my dataset? My dataset is 10 hours of speech with another 8 hours of augmented data derived from the orignal.

ghenter commented 10 months ago

I am now back from a trip abroad and can respond to your points here.

1.) The text front end ( espeak) supports Indic languages well, also the transformer based models leverages just that and no additional features.

When you say "espeak", do you mean espeak or espeak-ng? My understanding is that they are different systems, with Matcha-TTS using espeak-ng for its text processing. If there is a front end out there that uses espeak, I am not aware of it, but if the Transformer TTS system in question is using that, it could lead to different results.

2.) I reckon the deterministic nature of duration prediction is a viable explanation for this. May I ask why a stochastic duration model was not used

There are several reasons for this:

ddpms like matcha?

To be very clear, Matcha-TTS uses OT-CFM, not DDPMs. For one thing, OT-CFM uses ODEs (continuous time) whereas my understanding is that DDPMs are discrete-time stochastic processes.

by prosody I meant since prosody seems to have the largest effect(?) on synthesis quality

Here I must ask what you mean by "quality"? People assign many different meanings to that word, but the answer to your question hinges on what definition you are using in this case.

Do you mean a. segmental signal quality (e.g., CD quality vs. AM radio) or b. "naturalness" (e.g., human-like intonation vs. robotic intonation)? Or do you have a more applied definition of "quality", such as c. "the mean opinion score I get if I ask people to 'please rate the quality of these sentences on a scale from 1 to 5'"?

Measure a is in principle not affected by prosody at all, whereas measure b is strongly affected by it. Measure a can be strongly affected by the vocoder/signal generator used, whereas b is virtually unaffected by it. However, it is difficult to perform listening tests to only measure a in isolation and ignore the effect of b or vice versa.

If you mean option c, I would say that the picture is more complex. I think (without much empirical evidence to back it up) that mispronunciations, if/when present, probably have the biggest impact on MOS ratings, at least on on individual stimuli. Beyond that it gets complicated. The relative effects of segmental quality (a above) and prosody (b above) will depend on the quality and prosodic richness of the database (e.g., read versus spontaneous speech), what question listeners are asked, and much more.

In the Matcha-TTS paper, we asked listeners "How natural does the synthesised speech sound?" This is pretty open to interpretation by each individual listener, but I would expect this question formulation to give more weight to measure b (prosody) relative to a (segmental quality) than if we had asked about the "quality" of the speech instead. However, I cannot quote research on the top of my head to support this belief. Recent research of ours has found that numerical scores on MOS tests differ depending on which of these two questions that are asked, but disentangling the relative effects of different speech properties was not a focus of that work.

how is it that a large difference in Ljspeech trained on fastspeech2 vs matcha could have been observed in mos.

I am not sure what you are asking here. We try to argue in our paper that the stochastic nature of Matcha-TTS is a contributing factor to the improvement, and there are theoretical arguments (and experimental evidence in an early paper I co-authored) showing that there are fundamental limitations to treating TTS as a (deterministic) regression problem.

If, instead, you are expressing disbelief in the magnitude of the numerical difference we found, you can listen to our example audio stimuli from the listening test to see whether or not you personally agree with our listeners that a large difference in mean opinion scores is warranted.

Could it be that the hyperparams are not that suitable for my dataset? My dataset is 10 hours of speech with another 8 hours of augmented data derived from the orignal.

This is difficult for me to answer. Hyperparameter tuning (to the extent it is needed here) has always been a dark art of machine learning, and one that I have little insight into. I think it depends on much more than just the size of the dataset.

p0p4k commented 10 months ago

@shreyasinghal-17 By editing the code a little bit, you can provide your own character-level durations. It is useful to debug the model. Eg. Get character-level durations using real mel and text alignment (MAS). Then instead of matmul mu_x with predicted frame durations, use the MAS durations. Push that through the decoder and check the quality. If the quality is good enough, we now know for sure that the duration predictor is an issue. I have been thinking of making a AR dp as a seq2seq model with input being the text tokens and output being the expanded text tokens. To reduce the speed further in decoding (cause text token vocab size will affect the linear layer and softmax calculations), I am thinking we can ask the decoder in dp to predict just 2 tokens, "continue" or "change". When "continue", we use the current token, when "change" we jump to next token, all while traversing the input tokens one at a time (taking advantage of the fact that we use MAS, monotonic increasing, will not attend past). I think this kind of dur_pred might still be fast enough. I have realized that the dur_pred is what is causing issues in other models like vits as well.

shivammehta25 commented 10 months ago

Hello, sorry I was away due to personal reasons, I suppose you have a lot of answers to your problem, but I cannot point fingers exactly why this could be the case without much troubleshooting. Perhaps, your dataset has too weird variations for some utterances that mean prediction (which FP etc. are doing) is somehow better than stochastic sampling (which 🍵 Matcha-TTS and many other stochastic TTS do). Also, listening to audio examples (Native Hindi speaker here), I believe the 🍵 Matcha-TTS examples have a slower speaking rate, Could you try fine-tuning that? (I am saying this because unnatural breaks in the second part of the sentence क्या आप से बात करने के लिए ये समय सही है?).

Side Note: For me, the most confusing part of this question was the use of TransformerTTS. To my understanding Transformer TTS. It is a different (and autoregressive) architecture which is very different from Fast* based architecture.

I am closing this issue for now, but if you have any more questions please feel free to reopen this and continue the discussion.