Open 980202006 opened 2 years ago
Whether the batch size can be set relatively large, such as 64?
Hello, I have successfully tuned your code here and trained it with my own data (288 singers or speakers). What is the loss at which the model converges? I also noticed that stlye's loss is increasing from 0.006 to now 0.029. Is this normal? Is there an absolute way to represent human tone (not pitch related), is it a ratio of formants (I found out that other phonemes also affect the height of formants), or is it bandwidth?
Hello, you used Singing data and applied it to our model. We trained the model up to 100K, and directly listened and judged the quality of the actually generated voice rather than the loss. (Since adversarial training is used, the value of Loss may be different for each training.)
The question of the absolute way to express the human tone seems to be a very difficult task. It is believed that there are few studies that individually extract specific voice elements (timbre, tone, pitch), and I understand that research is being conducted mainly on extracting the overall voice characteristics at the current level. Therefore, I think that research on expressing the timbre (or other) is very important in the future.
We did not use a lot of GPU resources because we were experimenting in the academy environment. Therefore, it seems that the performance according to the large batch size should be tested. We would appreciate it if you let us know if you get good results.
thank you.
Thank you.I have completed 10,000 steps here, but the effect is not very good: I can complete the gender exchange of voices, but the loss of timbre is very serious. The results are as follows, I think it may be the effect of batch size, I am retraining here, and the results are expected next week. https://drive.google.com/drive/folders/1npMHnkqfbfm13zqas9uc1RWQvW8fY8rm
@980202006 since your particular use case involves singing, how did you get away with text labels? My case doesn't involve singing, only speaking, and my dataset doesn't have text labels, though they can be created, but that's a tedious process.
So, I'm wondering whether it's possible to skip on text label creation here.
Thank you, I did not introduce any text labels and used the original wav2vec2 model directly, which may also be one of the reasons for the loss of effect. In addition, in my other research, I found that the wav2vec2 model trained on speech data can indeed be used for singing, such as song language recognition, etc., but there will also be a loss of effect.
@980202006 how did you manage to train multi gpu ? It doesn't work for me out of the box.
File "/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'inference'
Also, what batch size per GPU was the most effective for you?
Have you met this situation? I can not find the soulution to it.
python prepare_dataset.py --in_dir data/VCTK/original/ --out_dir_name VCTK_16K --dataset VCTK log directory! -----> StyleVC_VCTK seen speakers! 87 unseen speakers! 20 start preprocessing Traceback (most recent call last): File "/root/StyleVC/prepare_dataset.py", line 237, in
main() File "/root/StyleVC/prepare_dataset.py", line 202, in main from text.text_English.cleaners import english_cleaners ModuleNotFoundError: No module named 'text'
Hello, I have successfully tuned your code here and trained it on my own data (288 singers or speakers), how much loss will the model converge? In addition, I found that the loss of stlye has been rising, from 0.006 to the current 0.029, is this normal? Is there any absolute way of expressing human timbre (not related to pitch), is it the ratio of formants (I noticed that different phonemes also affect the height of formants), or is it bandwidth?