lucidrains / e2-tts-pytorch

Implementation of E2-TTS, "Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS", in Pytorch
MIT License
228 stars 21 forks source link

Multilingual training intuition? #29

Closed zidsi closed 1 week ago

zidsi commented 1 week ago

Inspired by X2 described in the paper I assume with sufficient "text depth" model might be able to capture multiple languages. Did anyone experiment with X2 setup (replacing fraction of graphems with phonem transcription) or even multilanguage setup?

The challenge I'd like to address is lack of pronaunciation samples for (foreign) names in custom dataset I'm using.

The X2 setup looks promising, however it would still require "lookup" preprocessing/replacing at inference stage and considerable manual annotation of training dataset due to poor G2P support for target language.

My intuition is that "mix in" of English, French, Spanish samples might lead to better zero shot "foreign" words pronaunciation.

Any suggestion welcome.

lucidrains commented 1 week ago

it'll work, all the multilingual LLMs out there attest to that