Closed c469591 closed 7 months ago
Hi @c469591 !
"finetune"
feature means finetuning from another checkpoint by another experiment, not the pretrained vocoder/contentvec model file. For example if you/other people have trained a VitsSVC model well, and you want to use this model as a base to train another model on different singers, then you can set --resume_from_ckpt_path
to something like "[AnotherExperiment]/checkpoint/[SpecificCheckpoint]"
. You need to ensure that the model
field in exp_config.json
is the same as the checkpoint you want to use.Monotonic align not found
problem is not related to VitsSVC task, since VITS is originally a text-to-speech model so it uses Monotonic align to align the text distribution features with audio distribution feature, I'll modify this information along with the resume training feature development.Hello, thank you for your reply and explanation, but I still have some questions.
So what you're saying is that currently there are only three training methods available: starting from scratch, continuing training, and fine-tuning? We can't train using a pre-trained model like we did with svc4.1, is that correct? I'm not very clear on the difference between fine-tuning a model and training using a pre-trained model; I thought they were the same thing. Because I have also used svc5, starting training from scratch without using pre-trained models, Even if I spent several times longer, the model I trained still performed poorly. Therefore, I've always felt that training from scratch without using pre-trained models would require a significant amount of time to achieve the same results as when utilizing pre-trained models. I wonder if the current project is also like this? Additionally, another question is how much audio data would be required for effective results if I want to train my own voice?
Thank you for your question!
Difference between fine-tuning / using a pre-trained model Yes they are actually the same thing.
Can Amphion use a pre-trained model like svc4.1? For vitssvc we haven't train a pre-trained model yet, and since the codebase and architecture is totally different we cannot convert pre-trained model on svc4.1/svc5 to ours. We have provided a pretrained checkpoint for our MultipleContentSVC model. You can use it as a experiment directory and run direct inference to famous singers listed on the website, or finetune a new model.
Will training from scratch make performance poor? I personally don't think it will harm much as long as you have high-quality data for your dataset. Here are some audio samples I collected by training VitsSVC from scratch compared to finetuning pretrained so-vits-4.1 model:
Model | Description | test 1 | test 2 |
---|---|---|---|
Original Sample | / | 1_gt | 2_gt |
so-vits-4.1 | whisper feature, nsf-hifigan ,use pretrained model on VCTK, M4Singer dataset, shallow diffusion, pretrain 330k steps, finetune 110k steps | 1_sovits4.1 | 2_sovits4.1 |
vitssvc | contentVec feature, hifigan, train from scratch for 200k steps | 1_vitssvc | 2_vitssvc |
How much audio data is sufficient? This is a very good research problem. Many factors such as the diversity of the data distribution, the recording quality, and the speaker characteristics can all impact. In our opinion, maybe 15-20min under good recording quality may produce a not bad result. As an evidence, you can play our huggingface demo, whose training data can be seen in this table.
Hello, the demo you provided sounds very good, but I think most users probably don't have access to several hours of high-quality data. I've used RVC, SVC 4.1, and SVC 5 datasets which are roughly only 15 to 30 minutes long; the results I trained can't achieve the naturalness of your demo. Additionally, I just tried out the instruction manual you provided. https://huggingface.co/amphion/singing_voice_conversion I have already followed the instructions for the preliminary steps, but an error occurred in the last part. The log is quite long, so I've saved it to this file here → output.log The command I used was like this.
egs/svc/MultipleContentsSVC/run.sh --stage 3 --gpu "0" \
--config "ckpts/svc/vocalist_l1_contentvec+whisper/args.json" \
--infer_expt_dir "ckpts/svc/vocalist_l1_contentvec+whisper" \
--infer_output_dir "ckpts/svc/vocalist_l1_contentvec+whisper/result" \
--infer_source_audio_dir /opt/input/ \
--infer_vocoder_dir "pretrained/bigvgan_singing" \
--infer_target_speaker "vocalist_l1_TaylorSwift" \
--infer_key_shift "autoshift"
Please tell me how to resolve this error, Additionally, could you tell me where I can download the singer data that you used to train your model? For instance, the dry vocal data of several well-known Chinese-language singers. thank you for your patient explanation!
Hi @c469591 , is the whole duration of your training data is just 15-30 min? If so, it is indeed difficult to learn a good model. You can utilize the some academic datasets such as opencpop, m4singer and opensinger as the supplement (download instruction).
Loading Whisper Model...
Traceback (most recent call last):
File "/root/miniconda3/envs/amphion/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/amphion/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/root/miniconda3/envs/amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 994, in launch_command
simple_launcher(args)
File "/root/miniconda3/envs/amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
It seems that your output.log
is incomplete (there are no bug report about internal amphion's code, but only the external accelerate subprocess bug). I suspect that there is a issue about "Loading Whisper Model...". The common bug about it is that the server tries to automatically downloading the OpenAI's whisper checkpoint into the cache, but there could be a network connection timeout error.
Thank you for your response. So, am I to infer that the whisper he used was also invoked from the pretrained folder? Since the whisper model already exists in my path, it shouldn't need to be downloaded again. /root/Amphion/pretrained/whisper/medium.pt What else could have gone wrong?
Thank you for your response. So, am I to infer that the whisper he used was also invoked from the pretrained folder? Since the whisper model already exists in my path, it shouldn't need to be downloaded again. /root/Amphion/pretrained/whisper/medium.pt What else could have gone wrong?
Hi @c469591 !
Could you try again and capture a complete error message in your terminal? We need complete messages like
File "/Amphion/{some file}", line {any number}
to pinpoint where the problem is.
Hello, is there a good way to write output into a file? I tried using tee to write the output, but it wasn't complete. I also used screen to launch .sh scripts and write them into files, which was also incomplete.
Hello, is there a good way to write output into a file? I tried using tee to write the output, but it wasn't complete. I also used screen to launch .sh scripts and write them into files, which was also incomplete.
I'm still not sure why the output in your logs would be incomplete. If the terminal logs don't work well, would you mind providing a simple screenshot? Then we can work on it.
Hi @c469591, if you have any further questions, feel free to re-open this issue. We are glad to follow up!
Hello,I try to use the provided checkpoints to inference NS2,it also appears Monotonic align not found. Please make sure you have compiled it.
and the quality of the reconstructed waveform is very poor,is there any relationship between them?
Hello,I try to use the provided checkpoints to inference NS2,it also appears
Monotonic align not found. Please make sure you have compiled it.
and the quality of the reconstructed waveform is very poor,is there any relationship between them?
Hi @CreepJoye The monotonic_align module can be built by:
cd modules/monotonic_align
python setup.py build_ext --inplace
But this may not resolve ns2's infer quality since this issue is related to SVC task.
Hello,I try to use the provided checkpoints to inference NS2,it also appears
Monotonic align not found. Please make sure you have compiled it.
and the quality of the reconstructed waveform is very poor,is there any relationship between them?Hi @CreepJoye The monotonic_align module can be built by:
cd modules/monotonic_align python setup.py build_ext --inplace
But this may not resolve ns2's infer quality since this issue is related to SVC task.
Thank you for your reply,and I also want to know why there exists a big difference between the quality of the waveform reconstructed by the reference audio downloaded from the demo page https://speechresearch.github.io/naturalspeech2/ and LJ Speech data LJ001-0001.wav?Maybe different quality of the reference audio?
您好,请问MultipleContentSVC是否比如下方案: so-vits-4.1 whisper feature, nsf-hifigan ,use pretrained model on VCTK, M4Singer dataset, shallow diffusion, pretrain 330k steps, finetune 110k steps 效果更好呢?我们团队急需一个能超越so-vits-4.1的模型,我们对变声质量要求很高,找了很多, rvc, sovits 4.1, sovits 5, kits ai, audimee, 都会存在一些转换后咬字或者情感平淡的问题。如果MultipleContentSVC是最先进的方案,我们想尝试下。
Hi, is MultipleContentSVC better than the following solution: so-vits-4.1 whisper feature, nsf-hifigan, use pretrained model on VCTK, M4Singer dataset, shallow diffusion, pretrain 330k steps, finetune 110k steps?
Our team urgently needs a model that can surpass so-vits-4.1. We have very high requirements for voice conversion quality, and we have tried many, such as rvc, sovits 4.1, sovits 5, kits ai, audimee, all of which have some problems with hoarseness or emotional flatness after conversion. If MultipleContentSVC is the most advanced solution, we would like to try it.
Hello, I would like to run fine-tuning training for the model, but there was an error. I have already completed the preprocessing with this command. egs/svc/VitsSVC/run.sh --stage 1
Afterward, I intend to run fine-tuning training with this string of commands. According to read.md, I have also uploaded the well-trained 400000.pt to the server and clearly specified the absolute path of the model in my command as follows:
But in the command line output, the following error appeared first. Monotonic align not found. Please make sure you have compiled it.
After that, a series of error messages occurred. How should I resolve this? Thank you.