shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
https://shivammehta25.github.io/Matcha-TTS/
MIT License
611 stars 77 forks source link

What parameters could you suggest to tweak to improve RTF #78

Closed mush42 closed 3 months ago

mush42 commented 3 months ago

Hi,

Thanks again for the awesome work.

This issue is complementary to #52

I'm deploying Matcha-TTS on CPU for use with screen reader software for the visually-impaired. Screen readers require very low response time, even if it comes at the cost of degrading output quality.

What parameters (other than n_timesteps) do you suggest to tweak in order to improve RTF with no to little loss in output quality?

Best Musharraf

shivammehta25 commented 3 months ago

Hi @mush42 I hope you are doing okay! it is always nice to hear from you.

I mean if you are okay with degraded quality! I suggest starting with reducing encoder block and parameters, set https://github.com/shivammehta25/Matcha-TTS/blob/d31cd92a6122fb99987715248941c96744bf0a36/configs/model/encoder/default.yaml#L8 to 1.

and weaken the encoder https://github.com/shivammehta25/Matcha-TTS/blob/d31cd92a6122fb99987715248941c96744bf0a36/configs/model/encoder/default.yaml#L5

to 256 or something. Hopefully, the decoder is strong enough to realise generation without it. I remember during the https://github.com/shivammehta25/Neural-HMM days I removed the decoder and everything was still sounding good enough (of course a bit degraded but not so bad).

Next, https://github.com/shivammehta25/Matcha-TTS/blob/d31cd92a6122fb99987715248941c96744bf0a36/configs/model/decoder/default.yaml#L1

Try reducing it to [64, 64] and https://github.com/shivammehta25/Matcha-TTS/blob/d31cd92a6122fb99987715248941c96744bf0a36/configs/model/decoder/default.yaml#L3 to 32

However, these two might have a significant effect depending on the complexity of your dataset. These are the most important parameters I feel that you can reduce and everything should still work.

Let me know if you need any help training things.

Kind Regards, Shivam

mush42 commented 3 months ago

@shivammehta25 always nice to hear from you too!

Thanks for the comprehensive answer.

I feel like weakening the encoder is the least destructive option here. I can do encoder inference in one pass, and do decoder inference in chunks based on decoder receptive field, which by itself degrades quality, but YOLO.

Best Musharraf

shivammehta25 commented 3 months ago

Also, @ghenter rightfully reminded me. You can also reduce the number of times the decoder is called. (It is the number of times we query the neural network to push the random noise to speech).

https://github.com/shivammehta25/Matcha-TTS/blob/d31cd92a6122fb99987715248941c96744bf0a36/matcha/cli.py#L249

If the model is well trained even 1 step is good enough (the lower the value the lower the quality but we evaluated 2,4,10 and I feel any of those settings would work well enough for your use case)

Kind Regards, Shivam

mush42 commented 3 months ago

@shivammehta25

Thanks for the tip. I knew that option already from when I worked on onnx export. But I want more 🙂

I appreciate you giving me exact values of n_timesteps. It is something that is hard to set correctly.

Best Musharraf