jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.86k stars 1.26k forks source link

Questions about 48k audio file train #81

Open H4ppyB1rd opened 2 years ago

H4ppyB1rd commented 2 years ago

My terminal shows weird output when I started single-speaker training on audios of sampling rate = 48000hz, after I finished the last round of single-speaker training with fine results on the same audios resampled to default sampling rate 22050hz.

After I run train.py, the terminal throws this message:

(I guess this wasn't the crucial problem?)

Then this message:

... for dozens of rows.

Then this:

Then the training shows weird losses like:

each of the first four elements is nan. None of above happend with my previous 22050hz audio file training, so I'm wondering why and what I can do.(I've already modified json file in /configs to 48k sampling rate.) My apologies in advance if my questions were too basic.

nikich340 commented 2 years ago

You may try 44.1 KHz, worked for me. (set in config.json: sampling_rate = 44100). Also make sure your audio is 1-channel 16-bits wave.

H4ppyB1rd commented 2 years ago

You may try 44.1 KHz, worked for me. (set in config.json: sampling_rate = 44100). Also make sure your audio is 1-channel 16-bits wave.

Works for me. Thx!

tuannvhust commented 2 years ago

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

nikich340 commented 2 years ago

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

Rarely, I use good dataset (16 hours). If you have less than 2 hours of speech lines don't expect stable good results.

Also I edited processing scripts, so it accept straight IPA phonemes input (I used ng-espeak ipa preprocessing). In case you want model to generate some specific word. Make sure you made unified input (I used punctuation signs .,?! and ..), got rid of another-language-words, quotes. Preprocessing should do it, but check manually anyway.

codexq123 commented 1 year ago

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

Rarely, I use good dataset (16 hours). If you have less than 2 hours of speech lines don't expect stable good results.

Also I edited processing scripts, so it accept straight IPA phonemes input (I used ng-espeak ipa preprocessing). In case you want model to generate some specific word. Make sure you made unified input (I used punctuation signs .,?! and ..), got rid of another-language-words, quotes. Preprocessing should do it, but check manually anyway.

22050hz model produces low-quality speech (frequency range under 11k) which can be checked using Adobe Audition or mel spectrogram.

I wonder if the 44100hz model can produce a wider frequency range like 22k? Thanks in advance.

low-frequency

athenasaurav commented 1 year ago

Hello @nikich340

I m trying training an 8000Hz with 2 hours of data and changed it in the config file before training but my audio seems like it's mumbling, not speaking properly.

Here is the sample of original recording

Also the generated audio sound like this

Can you suggest what is wrong with it?