Matcha compared to Vits

Hi! That is a cool experiment.

Did you fine-tune the vocoder too? Why I am asking this is because: VITS has a built-in vocoder as it is an end-to-end TTS system. On the other hand, Matcha is an acoustic model where we learn to generate text-to-(log-mel-spectrogram). Currently, we have been using off-the-shelf neural vocoders namely, HiFiGAN, without fine-tuning them for matcha's Once log-mel-spectrogram output.

I think to fix this, you will have to fine-tune the vocoder. One way to do this would be to extract alignments. Then, use this extracted alignments to generate instead of the duration predictor's outputs, save the log-mel-spectrogram and then finetune, the vocoder.

One easier experiment might be to try switching the vocoder. You can switch from HiFiGAN to BigVGan off the shelf they use the same SFT parameters so, you don't need to retrain Matcha with different SFT settings.

Hope this helps, let me know if you have more questions :)

One side note is Matcha also has a temperature parameter: the more the temperature the more variance will be to the generated output it is also used only during the inference/generation so you can easily play with it. However, I still feel this is a vocoder artefact as end-to-end models have a waveform generation objective to optimise, while acoustic models do not.

shivammehta25 / Matcha-TTS

Matcha compared to Vits #97