jxzhanggg / nonparaSeq2seqVC_code

Implementation code of non-parallel sequence-to-sequence VC
MIT License
250 stars 56 forks source link

Pre-train model results #35

Closed ivancarapinha closed 4 years ago

ivancarapinha commented 4 years ago

Hello,

I trained the pre-train model with the following specs:

I obtained intelligible, but poor results in terms of voice conversion and quality in general. Also, I realized that the generated VC speech seems to be slower than the original source utterances. Besides, many of the generated samples (typically with 2-4 seconds of speech) have large sections of silence, sometimes more than 20 seconds. I include some of the samples (200k steps of training) and source utterances attached below. What could explain these problems? samples_checkpoint_200000.zip

Additionally, I would like to ask if the following issues could be some of the reasons for these bad results:

Thank you very much

jxzhanggg commented 4 years ago

Hi, For mean_std file, It should be estimated using only training data theoretically. However, I don't think it'll lead to bad result if you using all training data. For the speaker embedding, I believe it seems to be good enough. I suppose the reason is the learning rate decays too fast and the model doesn't get well trained. You can try to keep learning rate at 0.001 in first 70 epochs then decay it. For the starting pause, you can trim the beginning / ending silence when preparing training data.

ivancarapinha commented 4 years ago

Hello again @jxzhanggg, Thank you for your reply. I followed your suggestions and although I noticed a subtle improvement in terms of intelligibility and alignment (there are fewer silences now and the voice speed seems a bit more natural), the voice quality did not seem to change at all, as you can verify in these samples from checkpoints 56k and 98k. VC_samples.zip

Do you think the learning rate variation is the issue here? By the way, I am receiving warnings when I run the program, but all of those are deprecation warnings related to the versions of PyTorch, TensorFlow, and NumPy, so I think that is not problematic. I also checked the mel-spectrograms generated at inference stage and they seem fine, so I really don't know why the voice conversion task performs poorly in the code I run.

Could you please specify exactly what steps you took during pre-train to achieve your results? Thank you

ivancarapinha commented 4 years ago

Hello (once again) :)

I think I discovered what was the problem in the generated .wav files. It happens that librosa.load automatically converts the sampling rate of the .wav file to 22.05 kHz, while in the file inference.py, in lines: https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/e2fe19592b8c3a8189b609f890f1c8870b1ca0ed/pre-train/inference.py#L88

https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/e2fe19592b8c3a8189b609f890f1c8870b1ca0ed/pre-train/inference.py#L105

the sampling rate is defined as 16 kHz. This was causing severe distortion in the generated audio files, so we should choose sr=22050 in this case. I suggest you updated this piece of code, as it could save time and stress to other users that might face the same problem.

Cheers