jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.91k stars 1.27k forks source link

Problems with the pronunciation of one word. #139

Open LanglyAdrian opened 1 year ago

LanglyAdrian commented 1 year ago

I downloaded the dataset, started training with all the initial parameters. I changed only the batch size to 32. I reached 700k steps. As a result, he pronounces long phrases well, but if it is one word, the result is terrible. I don't think it makes sense to continue learning.

LanglyAdrian commented 1 year ago

@jaywalnut310, hi! Can you help me? Ready to pay! You created a great project and got great results, but I can't repeat them. After reading this, I suggested that the problem is that I changed the batch size. Could you tell me exactly how to change the rest of the parameters if I changed the batch size from 64 to 32?

NikitaKononov commented 1 year ago

@jaywalnut310, hi! Can you help me? Ready to pay! You created a great project and got great results, but I can't repeat them. After reading this, I suggested that the problem is that I changed the batch size. Could you tell me exactly how to change the rest of the parameters if I changed the batch size from 64 to 32?

Hi, what dataset do you use? And what problem exactly do you want to solve? Short phrases pronunciation ?

NikitaKononov commented 1 year ago

@jaywalnut310, hi! Can you help me? Ready to pay! You created a great project and got great results, but I can't repeat them. After reading this, I suggested that the problem is that I changed the batch size. Could you tell me exactly how to change the rest of the parameters if I changed the batch size from 64 to 32?

We can connect elsewhere to solve your issue faster, if it's still actual

LanglyAdrian commented 1 year ago

@NikitaKononov , hi!

I trained the model 2 times. 1) I downloaded this dataset, lowered the frequency of wav files to 22050 Hz, then deleted 80 entries (both wav files and from filelists), the size of which was more than 500kb and no .spec.pt files were created for them. Since I did this so that I could train my dataset (about 35 minutes), I decided to replace one of the voices (under ID 78) with mine. I chose 78 because it contained the same number of files as my dataset. I ran "python preprocess.py --text_index 1..." and noticed that some phrases (about 500) differ from those that the author received. I decided to use the new ones (which I received), because I didn’t want there to be a difference between the algorithm for obtaining phonemes for my dataset and for everyone else. Because I couldn’t run it with a batch size of 64 (I have a GeForce RTX 3090), I reduced the size to 32. As a result, I got the following results: Phrase 1: "These wards were all fitted with barrack-beds, but no bedding was supplied." My, author. Phrase 2: "capital" My, author

As you can see, everything is fine with a long phrase, but with a short one…

I thought that perhaps the problem was that I used my own dataset instead of voice 78, but when I found this, I realized that this problem is far from being only with me and, accordingly, my dataset has nothing to do with it. I thought that perhaps the problem is in the silence, which is in many files. And then I started looking for datasets in other places.

2) Found this dataset. I noticed that there seems to be no problems with silence. Refused ID 315 and ID 362 because there were differences between wav files and filelists. Instead, I put my own dataset and s5 (whose voice was in the dataset from a new source). Changed the format from flac to wav, and also lowered the frequency of wav files to 22050 Hz. Then again I deleted the files, the size of which was more than 500kb, created new filelists leaving 500 files for test and 100 for val, as in the original, and started training (again with a batch size of 32). As a result, I got similar results as in the first training.

If you feel more comfortable elsewhere, we could continue on facebook, twitter, email. However, in the end, I'd like to post the solution here to help others.

LanglyAdrian commented 1 year ago

@NikitaKononov , By the way, if you are from Russia, we could communicate in Russian. My English is very bad.

NikitaKononov commented 1 year ago

@NikitaKononov , By the way, if you are from Russia, we could communicate in Russian. My English is very bad.

Telegram @drakononov e-mail kononoff.174@yandex.kz

NikitaKononov commented 1 year ago

@NikitaKononov , hi!

I trained the model 2 times.

  1. I downloaded this dataset, lowered the frequency of wav files to 22050 Hz, then deleted 80 entries (both wav files and from filelists), the size of which was more than 500kb and no .spec.pt files were created for them. Since I did this so that I could train my dataset (about 35 minutes), I decided to replace one of the voices (under ID 78) with mine. I chose 78 because it contained the same number of files as my dataset. I ran "python preprocess.py --text_index 1..." and noticed that some phrases (about 500) differ from those that the author received. I decided to use the new ones (which I received), because I didn’t want there to be a difference between the algorithm for obtaining phonemes for my dataset and for everyone else. Because I couldn’t run it with a batch size of 64 (I have a GeForce RTX 3090), I reduced the size to 32. As a result, I got the following results: Phrase 1: "These wards were all fitted with barrack-beds, but no bedding was supplied." My, author. Phrase 2: "capital" My, author

As you can see, everything is fine with a long phrase, but with a short one…

I thought that perhaps the problem was that I used my own dataset instead of voice 78, but when I found this, I realized that this problem is far from being only with me and, accordingly, my dataset has nothing to do with it. I thought that perhaps the problem is in the silence, which is in many files. And then I started looking for datasets in other places.

  1. Found this dataset. I noticed that there seems to be no problems with silence. Refused ID 315 and ID 362 because there were differences between wav files and filelists. Instead, I put my own dataset and s5 (whose voice was in the dataset from a new source). Changed the format from flac to wav, and also lowered the frequency of wav files to 22050 Hz. Then again I deleted the files, the size of which was more than 500kb, created new filelists leaving 500 files for test and 100 for val, as in the original, and started training (again with a batch size of 32). As a result, I got similar results as in the first training.

If you feel more comfortable elsewhere, we could continue on facebook, twitter, email. However, in the end, I'd like to post the solution here to help others.

I use rtx3090 right now, it can handle bs 64 with AMP (fp16_train = true in config) If you decrease batch size from 64 to 32, you should decrease learning rate 2 times from 2e-4 to 1e-4 same with increasing. For GANs it's important

VCTK use "magic" sentence sequence, that tries to maximize fonetic coverage of each speaker. So 35 minutes of common data can be not enough. You can show a couple of examples from you dataset, so I can evaluate your data quality (slicing, transcription)

NikitaKononov commented 1 year ago

In your samples I can clearly hear data-hunger typical for VITS or it can be a syndrome of poor data markup quality or both

LR of course matters too

JohnHerry commented 1 year ago

If it is caused by data-hunger, then how much data needed for each speaker if I make a multi-speaker instance?

LanglyAdrian commented 1 year ago

@JohnHerry If your question concerns the problem that I described, then this is not due to the fact that there is not enough data.

inventor617 commented 1 year ago

Hi,@NikitaKononov! is it possible to use vctk russian dataset?

nikich340 commented 1 year ago

Try to add short phrases into your dataset. If it's trained to say some phonemes only in connection with other, it can't do single word well.

nikich340 commented 1 year ago

If it is caused by data-hunger, then how much data needed for each speaker if I make a multi-speaker instance?

About 2 hours is minimum for good result.