Questions about 48k audio file train

H4ppyB1rd commented 2 years ago

My terminal shows weird output when I started single-speaker training on audios of sampling rate = 48000hz, after I finished the last round of single-speaker training with fine results on the same audios resampled to default sampling rate 22050hz.

After I run train.py, the terminal throws this message:

warning: audio amplitude out of range, auto clipped.

(I guess this wasn't the crucial problem?)

Then this message:

max value is tensor(33528.1016)
min value is tensor(-17584.6523)
max value is tensor(25380.4434)
min value is tensor(-38273.9297)
max value is tensor(50959.3125)
min value is tensor(-37103.1211)
max value is tensor(37702.8320)
min value is tensor(-33512.7734)

... for dozens of rows.

Then this:

[INFO] ====> Epoch: 1
/root/.local/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

Then the training shows weird losses like:

[INFO] Train Epoch: 21 [0%]
[INFO] [nan, nan, nan, nan, 2.0102107524871826, 208.6475067138672, 1600, 0.00019950059330492385]

each of the first four elements is nan. None of above happend with my previous 22050hz audio file training, so I'm wondering why and what I can do.(I've already modified json file in /configs to 48k sampling rate.) My apologies in advance if my questions were too basic.

nikich340 commented 2 years ago

You may try 44.1 KHz, worked for me. (set in config.json: sampling_rate = 44100). Also make sure your audio is 1-channel 16-bits wave.

H4ppyB1rd commented 2 years ago

You may try 44.1 KHz, worked for me. (set in config.json: sampling_rate = 44100). Also make sure your audio is 1-channel 16-bits wave.

Works for me. Thx!

tuannvhust commented 2 years ago

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

nikich340 commented 2 years ago

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

Rarely, I use good dataset (16 hours). If you have less than 2 hours of speech lines don't expect stable good results.

Also I edited processing scripts, so it accept straight IPA phonemes input (I used ng-espeak ipa preprocessing). In case you want model to generate some specific word. Make sure you made unified input (I used punctuation signs .,?! and ..), got rid of another-language-words, quotes. Preprocessing should do it, but check manually anyway.

codexq123 commented 1 year ago

@nikich340 does your speech synthesis have a good result? My result is ok but the quality of speech is not so good, there is still noise in it and some mispronounciation? Do you get the same problem?

Rarely, I use good dataset (16 hours). If you have less than 2 hours of speech lines don't expect stable good results.

Also I edited processing scripts, so it accept straight IPA phonemes input (I used ng-espeak ipa preprocessing). In case you want model to generate some specific word. Make sure you made unified input (I used punctuation signs .,?! and ..), got rid of another-language-words, quotes. Preprocessing should do it, but check manually anyway.

22050hz model produces low-quality speech (frequency range under 11k) which can be checked using Adobe Audition or mel spectrogram.

I wonder if the 44100hz model can produce a wider frequency range like 22k? Thanks in advance.

low-frequency

athenasaurav commented 1 year ago

Hello @nikich340

I m trying training an 8000Hz with 2 hours of data and changed it in the config file before training but my audio seems like it's mumbling, not speaking properly.

Here is the sample of original recording

Also the generated audio sound like this

Can you suggest what is wrong with it?

jaywalnut310 / vits

Questions about 48k audio file train #81