Closed reveperdu closed 1 year ago
‘I know this is possibly a problem with the tokenizer or encoders, but I don't know how to adjust or replace them’ --> as vits is an end-to-end text2speech model, this is not a problem with the tokenizer or encoders.
Actually I just finetune the model on the '_jsut_' based model with 32 sentence in steins gate games.
To synthesize English or Chinese, this is not easy problem. Because you should train a based model with English and Chinese data. Then finetune it again.
But I'm busy and can not spend too much time on it. If you are urged to do this, I'm pleased to help you.
Thanks for reply! I will try to figure it out
I downloaded the model from huggingface, and loaded it by calling espnet2.bin.tts_inference. It can synthesize fluent Japanese, but it seem to be able to read only English letters, not words or sentences eg. "hello world" I know this is possibly a problem with the tokenizer or encoders, but I don't know how to adjust or replace them.