About the loss when converging

980202006 commented 2 years ago

Hello, I have successfully tuned your code here and trained it on my own data (288 singers or speakers), how much loss will the model converge? In addition, I found that the loss of stlye has been rising, from 0.006 to the current 0.029, is this normal? Is there any absolute way of expressing human timbre (not related to pitch), is it the ratio of formants (I noticed that different phonemes also affect the height of formants), or is it bandwidth？

980202006 commented 2 years ago

Whether the batch size can be set relatively large, such as 64?

insunhwang89 commented 2 years ago

Hello, I have successfully tuned your code here and trained it with my own data (288 singers or speakers). What is the loss at which the model converges? I also noticed that stlye's loss is increasing from 0.006 to now 0.029. Is this normal? Is there an absolute way to represent human tone (not pitch related), is it a ratio of formants (I found out that other phonemes also affect the height of formants), or is it bandwidth?

Hello, you used Singing data and applied it to our model. We trained the model up to 100K, and directly listened and judged the quality of the actually generated voice rather than the loss. (Since adversarial training is used, the value of Loss may be different for each training.)

The question of the absolute way to express the human tone seems to be a very difficult task. It is believed that there are few studies that individually extract specific voice elements (timbre, tone, pitch), and I understand that research is being conducted mainly on extracting the overall voice characteristics at the current level. Therefore, I think that research on expressing the timbre (or other) is very important in the future.

We did not use a lot of GPU resources because we were experimenting in the academy environment. Therefore, it seems that the performance according to the large batch size should be tested. We would appreciate it if you let us know if you get good results.

thank you.

980202006 commented 2 years ago

Thank you.I have completed 10,000 steps here, but the effect is not very good: I can complete the gender exchange of voices, but the loss of timbre is very serious. The results are as follows, I think it may be the effect of batch size, I am retraining here, and the results are expected next week. https://drive.google.com/drive/folders/1npMHnkqfbfm13zqas9uc1RWQvW8fY8rm

skol101 commented 2 years ago

@980202006 since your particular use case involves singing, how did you get away with text labels? My case doesn't involve singing, only speaking, and my dataset doesn't have text labels, though they can be created, but that's a tedious process.

So, I'm wondering whether it's possible to skip on text label creation here.

980202006 commented 2 years ago

Thank you, I did not introduce any text labels and used the original wav2vec2 model directly, which may also be one of the reasons for the loss of effect. In addition, in my other research, I found that the wav2vec2 model trained on speech data can indeed be used for singing, such as song language recognition, etc., but there will also be a loss of effect.

skol101 commented 2 years ago

@980202006 how did you manage to train multi gpu ? It doesn't work for me out of the box.

  File "/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'inference'

Also, what batch size per GPU was the most effective for you?

Superman-Valencia commented 1 year ago

Have you met this situation? I can not find the soulution to it.

python prepare_dataset.py --in_dir data/VCTK/original/ --out_dir_name VCTK_16K --dataset VCTK log directory! -----> StyleVC_VCTK seen speakers! 87 unseen speakers! 20 start preprocessing Traceback (most recent call last): File "/root/StyleVC/prepare_dataset.py", line 237, in main() File "/root/StyleVC/prepare_dataset.py", line 202, in main from text.text_English.cleaners import english_cleaners ModuleNotFoundError: No module named 'text'

insunhwang89 / StyleVC

About the loss when converging #3