shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
https://shivammehta25.github.io/Matcha-TTS/
MIT License
636 stars 80 forks source link

Matcha compared to Vits #97

Open yygg678 opened 6 days ago

yygg678 commented 6 days ago

I replicated the results of VITS and Matcha-TTS on a single speaker Chinese dataset and found that the timbre similarity of Matcha-TTS is lower than that of VITS, especially in the high-frequency details of the spectrum. Below are the spectrograms of VITS and Matcha-TTS. Is there any way to improve the timbre similarity of Matcha-TTS? v m

shivammehta25 commented 2 days ago

Hi! That is a cool experiment.

Did you fine-tune the vocoder too? Why I am asking this is because: VITS has a built-in vocoder as it is an end-to-end TTS system. On the other hand, Matcha is an acoustic model where we learn to generate text-to-(log-mel-spectrogram). Currently, we have been using off-the-shelf neural vocoders namely, HiFiGAN, without fine-tuning them for matcha's Once log-mel-spectrogram output.

I think to fix this, you will have to fine-tune the vocoder. One way to do this would be to extract alignments. Then, use this extracted alignments to generate instead of the duration predictor's outputs, save the log-mel-spectrogram and then finetune, the vocoder.

One easier experiment might be to try switching the vocoder. You can switch from HiFiGAN to BigVGan off the shelf they use the same SFT parameters so, you don't need to retrain Matcha with different SFT settings.

Hope this helps, let me know if you have more questions :)

One side note is Matcha also has a temperature parameter: the more the temperature the more variance will be to the generated output it is also used only during the inference/generation so you can easily play with it. However, I still feel this is a vocoder artefact as end-to-end models have a waveform generation objective to optimise, while acoustic models do not.