nii-yamagishilab / ZMM-TTS

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
BSD 3-Clause "New" or "Revised" License
131 stars 9 forks source link

Zero shot inference with pretrained model #4

Closed ZhongJiafeng-16 closed 6 months ago

ZhongJiafeng-16 commented 6 months ago

I want to take some test on my enviroment, and I download the pretrained model file from https://drive.google.com/drive/folders/1lx_D-8tmEFExjAS54XpG3denGneBdvr0?usp=drive_link. When i run the shell script:

python3 txt2vec/synthesize.py --text "tts_test" --restore_step 1200000 --mode quick_test --dataset MM6 --config MM6_XphoneBERT --input_textlist Dataset/MM6/test.txt --output_dir test_result/pred_vecfromxp

I meet some error: KeyError: 'use_lang'. it seem like the model config file have not the key 'use_lang' which is used in synthesize.py.

Could you give a simple demo for zero-shot TTS inference with pretrained model? Thank you. I am looking forward your reply.

gongchenghhu commented 6 months ago

@ZhongJiafeng-16 Thanks for trying. We plan to make a comprehensive upgrade to our code and make more pre-trained models public after the paper is accepted. Anyway, here's a quick modification you can make to support inference. Please add use_lang=True at the end of Config/txt2vec/MM6_XphoneBERT/model.yaml. Add use_lang=False at the end of Config/txt2vec/MM6_XphoneBERT_wo/model.yaml. Note: for zero-shot inference, you should use the **_wo.yaml. Thanks again. If you still have any problems, please let me know.

ZhongJiafeng-16 commented 6 months ago

@gongchenghhu Thanks for the rapid reply. I have run the inferece script quick_test.sh sucessful and I got clear and understandable output speech. Then i try to extract speaker embedding from my custom audios, after that I infer text with specific custom speaker embedding to speech same with quick_test.sh. and the output speech have good quilty but the voice tone does not transfer to the output speech. both .yaml and _wo.yaml config are same. Could you give me some clues? Thanks for your help.

gongchenghhu commented 6 months ago

@ZhongJiafeng-16 Could you tell me more details? Thanks for trying. Could you tell me more details? The language of your reference audio?

ZhongJiafeng-16 commented 6 months ago

The language of speech is English and the content of input_textlist file are follow: 1_Grace|custom|English|0001|Good evening, have you eaten today? What to eat? Have you ever encountered anything happy? The 1_Grace is my reference speech filename.

Here are the steps i did:

  1. run prepare_data/extrack_spk_emb.py to extract speaker embedding from reference speech and i get .npy file in output folder.
  2. run txt2vec/synthesize.py to get the vec file.
  3. run test_scripts/prepare_for_vec2wav.py
  4. run vec2wav/infer.py
  5. then i get one speech file in test_result/zmm_tts2_xp/0000_generated.wav. the reference audio is a female voice, but the output sounds like a male voice.

and another question:

Note: for zero-shot inference, you should use the **_wo.yaml.

Could you explain more about this? Are there any other differences between the _wo model and the original model besides the language layer?

gongchenghhu commented 6 months ago

@ZhongJiafeng-16 1) For voice tone problem: I took a look at your flow, and there isn't much problem. Could you test it with some other speakers? Because the number of speakers in our training set is limited, you can try to train yourself using data containing more speakers, such as the LibriTTS data. In our latest experiments, we found that the similarity to unseen speakers has been greatly improved with large-scale training data. Maybe the 1_Grace also should be norm by sv56? Is Grace a common timbre or a special timbre?

2) For the _wo model Yes, the only difference is the language layer.

ZhongJiafeng-16 commented 6 months ago

Thanks again and the suggestion really make sense. I miss the normalization of reference speech. I will try it later.