Question about: punctuation in script and voice mix data.

drlor2k commented 1 month ago

Thank you @v-nhandt21 for sharing the repo. I have two questions, if you have time, please help me.

Before script is transformed into phoneme through the function vi2IPA_split, does it need to remove punctuation marks? Because I see that the vivos dataset has no punctuation. Assuming we don't need to remove the punctuation, will it affect the output? For example, silent is longer when there is a , mark.
I see that vivos data mixes male and female voices, assuming my dataset only focuses on one gender and one voice, will this make the final output better?

v-nhandt21 commented 1 month ago

Thank you @v-nhandt21 for sharing the repo. I have two questions, if you have time, please help me.

Before script is transformed into phoneme through the function vi2IPA_split, does it need to remove punctuation marks? Because I see that the vivos dataset has no punctuation. Assuming we don't need to remove the punctuation, will it affect the output? For example, silent is longer when there is a , mark.

I see that vivos data mixes male and female voices, assuming my dataset only focuses on one gender and one voice, will this make the final output better?

Hi @drlor2k:

As my experiment, there are two main pause duration in speech synthesis, short and long pause, therefore I convert all punctuation to "," or "."
The VIVOS is example script for training only, to train a voice cloning model from scratch, I think you should collect more than 200h audios in clean quality.

You can check out this Repo: https://github.com/thinhlpg/vixtts-demo , they provide available pretrain for fine-tuning

drlor2k commented 1 month ago

Thanks for your response @v-nhandt21, I tried https://github.com/thinhlpg/vixtts-demo, it's a great attempt but it lacks the necessary stability. I actually forgot that I could fine-tune it :v

drlor2k commented 1 month ago

hello @v-nhandt21, I have some takeaways from VITS2 and XTTS, can you give your opinion?

In terms of sound output quality, VITS may be better than XTTS.
XTTS is based on text-to-token via tokenizer, so it covers almost all words, including words outside the training language, which gives it the ability to pronounce some common foreign words, as long as these words appear in the training data. In contrast, VITS depends on text-to-phoneme, and thus foreign words almost always have no corresponding phoneme.
Based on number 2. intuitively we should eliminate audio with foreign pronunciation, because:

Audio contains sounds of foreign words
On the contrary, the phoneme part is missing because the foreign word has been converted to /
This leads to inconsistencies between audio and text.

How do we deal with out-of-phoneme of VITS?

A quick way is to convert the word into parts that VITS can pronounce, for example hello to hé lô. However, this is a superficial way because it does not solve the root of the problem.
If you have experience, can you give me a solution, that is, solve it from a training perspective, that is, make adjustments in the viphoneme package.

Thank you if you take the time to respond!

v-nhandt21 commented 1 month ago

Hi @drlor2k ,

VITS is a model for one language and XTTS is a multilingual, but I think multilingual can not cover all norm phonemics in a practical product. Therefore, we still need to use a traditional method to control cases out of vocabulary.
Beside a dictionary checking method you mentioned in (4), you can try to use force alignment which is a model to predict phoneme, to train it, we only need a dictionary, in the inference stage, the model would predict any out of vocab: https://github.com/v-nhandt21/ViMFA

drlor2k commented 1 month ago

Thank you for your response @v-nhandt21, I have another question, can you help me?

I see that some speech2speech repo uses a very small val dataset (2 records each voice), basically I understand they want to overfit as much as possible with the voice that needs to be cloned.

With your repo, does the val dataset affect the training process?

if yes: suppose my training data is 200h, what is the appropriate size of the val dataset?
if no: maybe I should keep the val dataset quite small to make the training process faster, what do you think?

v-nhandt21 commented 1 month ago

Thank you for your response @v-nhandt21, I have another question, can you help me?

I see that some speech2speech repo uses a very small val dataset (2 records each voice), basically I understand they want to overfit as much as possible with the voice that needs to be cloned.

With your repo, does the val dataset affect the training process?

if yes: suppose my training data is 200h, what is the appropriate size of the val dataset?

if no: maybe I should keep the val dataset quite small to make the training process faster, what do you think?

My repo is a variant/adaptation of VITS for Vietnamese, therefore it is text-to-speech and validation set has no effect!

For some repo like https://github.com/svc-develop-team/so-vits-svc?tab=readme-ov-file, it is speech-to-speech because they try to learn speech characteristics on small set then inject these feature to control style of output voice

v-nhandt21 / ViSV2TTS

Question about: punctuation in script and voice mix data. #9