about the text tokenizer for multilingual

lucidrains / e2-tts-pytorch

Implementation of E2-TTS, "Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS", in Pytorch

MIT License

228 stars 21 forks source link

about the text tokenizer for multilingual #21

Closed JohnHerry closed 1 month ago

JohnHerry commented 1 month ago

Hi, I see the text tokenizer just a simple "bytes(text, "UTF-8")", will it support multiple-languages？ eg. If my input are Chinese-English mixed sentences, in which case each Chinese character will be converted into three tokens valued between 【0, 255】, and each English character is valued 【0， 128】, can this input be trained? or we should choose another text tokenzer in this case?

AshwinSankar17 commented 1 month ago

Since the authors are using a Character level Text Encoder, I see no problem in using the encoder of ByT5 which should be able to handle all languages irrespective of vocabulary.

JohnHerry commented 1 month ago

Thank you for the help, I will have a try.