v-nhandt21 / ViSV2TTS

Vietnamese Voice Cloning System using Speaker Verification training on multispeaker VITS
38 stars 14 forks source link

Question about: punctuation in script and voice mix data. #9

Open drlor2k opened 1 month ago

drlor2k commented 1 month ago

Thank you @v-nhandt21 for sharing the repo. I have two questions, if you have time, please help me.

  1. Before script is transformed into phoneme through the function vi2IPA_split, does it need to remove punctuation marks? Because I see that the vivos dataset has no punctuation. Assuming we don't need to remove the punctuation, will it affect the output? For example, silent is longer when there is a , mark.

  2. I see that vivos data mixes male and female voices, assuming my dataset only focuses on one gender and one voice, will this make the final output better?

v-nhandt21 commented 1 month ago

Thank you @v-nhandt21 for sharing the repo. I have two questions, if you have time, please help me.

  1. Before script is transformed into phoneme through the function vi2IPA_split, does it need to remove punctuation marks? Because I see that the vivos dataset has no punctuation. Assuming we don't need to remove the punctuation, will it affect the output? For example, silent is longer when there is a , mark.
  2. I see that vivos data mixes male and female voices, assuming my dataset only focuses on one gender and one voice, will this make the final output better?

Hi @drlor2k:

You can check out this Repo: https://github.com/thinhlpg/vixtts-demo , they provide available pretrain for fine-tuning

drlor2k commented 1 month ago

Thanks for your response @v-nhandt21, I tried https://github.com/thinhlpg/vixtts-demo, it's a great attempt but it lacks the necessary stability. I actually forgot that I could fine-tune it :v

drlor2k commented 1 month ago

hello @v-nhandt21, I have some takeaways from VITS2 and XTTS, can you give your opinion?

  1. In terms of sound output quality, VITS may be better than XTTS.

  2. XTTS is based on text-to-token via tokenizer, so it covers almost all words, including words outside the training language, which gives it the ability to pronounce some common foreign words, as long as these words appear in the training data. In contrast, VITS depends on text-to-phoneme, and thus foreign words almost always have no corresponding phoneme.

  3. Based on number 2. intuitively we should eliminate audio with foreign pronunciation, because:

  1. How do we deal with out-of-phoneme of VITS?

Thank you if you take the time to respond!

v-nhandt21 commented 1 month ago

Hi @drlor2k ,

drlor2k commented 1 month ago

Thank you for your response @v-nhandt21, I have another question, can you help me?

I see that some speech2speech repo uses a very small val dataset (2 records each voice), basically I understand they want to overfit as much as possible with the voice that needs to be cloned.

With your repo, does the val dataset affect the training process?

v-nhandt21 commented 1 month ago

Thank you for your response @v-nhandt21, I have another question, can you help me?

I see that some speech2speech repo uses a very small val dataset (2 records each voice), basically I understand they want to overfit as much as possible with the voice that needs to be cloned.

With your repo, does the val dataset affect the training process?

  • if yes: suppose my training data is 200h, what is the appropriate size of the val dataset?
  • if no: maybe I should keep the val dataset quite small to make the training process faster, what do you think?

My repo is a variant/adaptation of VITS for Vietnamese, therefore it is text-to-speech and validation set has no effect!

For some repo like https://github.com/svc-develop-team/so-vits-svc?tab=readme-ov-file, it is speech-to-speech because they try to learn speech characteristics on small set then inject these feature to control style of output voice