Multilingual training intuition?

Inspired by X2 described in the paper I assume with sufficient "text depth" model might be able to capture multiple languages. Did anyone experiment with X2 setup (replacing fraction of graphems with phonem transcription) or even multilanguage setup?

The challenge I'd like to address is lack of pronaunciation samples for (foreign) names in custom dataset I'm using.

The X2 setup looks promising, however it would still require "lookup" preprocessing/replacing at inference stage and considerable manual annotation of training dataset due to poor G2P support for target language.

My intuition is that "mix in" of English, French, Spanish samples might lead to better zero shot "foreign" words pronaunciation.

Any suggestion welcome.

lucidrains / e2-tts-pytorch

Multilingual training intuition? #29