shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
https://shivammehta25.github.io/Matcha-TTS/
MIT License
751 stars 98 forks source link

Need help regrding multi-speaker longform speech generation #38

Closed blueprintparadise closed 10 months ago

blueprintparadise commented 11 months ago

Could you provide some resources for a long-form speech generation code that allows for switching between multiple speakers within the same text similar to what you did in the youtube video.

shivammehta25 commented 11 months ago

Hello! Thank you for your interest in 🍵 Matcha-TTS.

This is a great idea, however I do not have a straightforward interface to do it currently. But you can pass two additional lists one of speaker IDs and a list of tuples of their phone boundaries. Later once you pass speaker IDs through the speaker encoder and obtain speaker representations, instead of, broadcasting it for all x, just broadcast only for the specific word boundaries here

https://github.com/shivammehta25/Matcha-TTS/blob/c8d0d60f87147fe340f4627b84588e812e5fbb00/matcha/models/components/text_encoder.py#L403

and in the decoder.

https://github.com/shivammehta25/Matcha-TTS/blob/c8d0d60f87147fe340f4627b84588e812e5fbb00/matcha/models/components/decoder.py#L387

Hope this helps :)

shivammehta25 commented 10 months ago

Hello, I am assuming you are happy with the answer for now, please feel free to reopen the issue and continue the discussions in case of any further points.