shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
https://shivammehta25.github.io/Matcha-TTS/
MIT License
751 stars 98 forks source link

How are short utterances performing ? #45

Closed lumpidu closed 9 months ago

lumpidu commented 10 months ago

Hi, thanks for your interesting Model ! I was reading in an issue here, that you recommend to train on audio samples > X seconds, where X is a number larger than 2 seconds or even more.

Besides the concrete reasons for the model to require such short audio samples, I would be interested to know if you already have performed tests on short utterances ? I would be especially interested in e.g. how spelling, letters and short numbers are performing ?

Background: for people with reading disabilities, the screen keyboard is an essential writing tool. The users use screen readers to read those keyboard letters aloud. In my experience, many models have problems with such short utterances, but it helps to train with enough appropriate short sample audios - which would be problematic with Matcha-TTS ?

shivammehta25 commented 10 months ago

Hello,

Thank you for your interest in our work.

You don't need to retrain Matcha for it. You can synthesise single words and even letters as long as you've trained it on a well-behaved dataset. I haven't noticed the model breaking down in such cases. To verify use the Single Speaker (LJ Speech) checkpoint. It is also available on our HuggingFace Space (select the radio button Single Speaker (LJ Speech).

I was reading in an issue here, that you recommend to train on audio samples > X seconds, where X is a number larger than 2 seconds or even more.

It is not a must but rather a recommendation. Mainly because if the audio is too small and wrongly annotated the length of the text can exceed the length of the mel spectrogram which results in the breakdown of the Monotonic Alignment Search (MAS) that is used to learn the alignments between text and audio. But again this is a problem with ill-behaved datasets. But the synthesis is unaffected by this as MAS is a training-time heuristic.

One reason why It doesn't work on the multispeaker checkpoint is that I didn't trim the silence present in each file in the VCTK corpus. If you trim it and then train it this will solve the problem. But works fine for LJSpeech, showing that it is capable of synthesising such short utterances.

Regards, Shivam

shivammehta25 commented 9 months ago

I am closing it for now, feel free to reopen in case you have some further questions.