open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
7.61k stars 570 forks source link

Monotonic align not found. Please make sure you have compiled it. #91

Closed c469591 closed 7 months ago

c469591 commented 10 months ago

Hello, I would like to run fine-tuning training for the model, but there was an error. I have already completed the preprocessing with this command. egs/svc/VitsSVC/run.sh --stage 1

Afterward, I intend to run fine-tuning training with this string of commands. According to read.md, I have also uploaded the well-trained 400000.pt to the server and clearly specified the absolute path of the model in my command as follows:

sh egs/svc/VitsSVC/run.sh --stage 2 --name tingting \
    --resume true \
    --resume_from_ckpt_path "/root/Amphion/pretrained/bigvgan/400000.pt" \
    --resume_type "finetune"

But in the command line output, the following error appeared first. Monotonic align not found. Please make sure you have compiled it.
After that, a series of error messages occurred. How should I resolve this? Thank you.

viewfinder-annn commented 10 months ago

Hi @c469591 !

  1. The "finetune" feature means finetuning from another checkpoint by another experiment, not the pretrained vocoder/contentvec model file. For example if you/other people have trained a VitsSVC model well, and you want to use this model as a base to train another model on different singers, then you can set --resume_from_ckpt_path to something like "[AnotherExperiment]/checkpoint/[SpecificCheckpoint]". You need to ensure that the model field in exp_config.json is the same as the checkpoint you want to use.
  2. VitsSVC's resume training feature is under development, it will be able to use it in about one week! I recommend train your model from scratch / try MultipleContentsSVC first.
  3. Monotonic align not found problem is not related to VitsSVC task, since VITS is originally a text-to-speech model so it uses Monotonic align to align the text distribution features with audio distribution feature, I'll modify this information along with the resume training feature development.
c469591 commented 10 months ago

Hello, thank you for your reply and explanation, but I still have some questions.

So what you're saying is that currently there are only three training methods available: starting from scratch, continuing training, and fine-tuning? We can't train using a pre-trained model like we did with svc4.1, is that correct? I'm not very clear on the difference between fine-tuning a model and training using a pre-trained model; I thought they were the same thing. Because I have also used svc5, starting training from scratch without using pre-trained models, Even if I spent several times longer, the model I trained still performed poorly. Therefore, I've always felt that training from scratch without using pre-trained models would require a significant amount of time to achieve the same results as when utilizing pre-trained models. I wonder if the current project is also like this? Additionally, another question is how much audio data would be required for effective results if I want to train my own voice?

viewfinder-annn commented 10 months ago

Thank you for your question!

Difference between fine-tuning / using a pre-trained model Yes they are actually the same thing.

Can Amphion use a pre-trained model like svc4.1? For vitssvc we haven't train a pre-trained model yet, and since the codebase and architecture is totally different we cannot convert pre-trained model on svc4.1/svc5 to ours. We have provided a pretrained checkpoint for our MultipleContentSVC model. You can use it as a experiment directory and run direct inference to famous singers listed on the website, or finetune a new model.

Will training from scratch make performance poor? I personally don't think it will harm much as long as you have high-quality data for your dataset. Here are some audio samples I collected by training VitsSVC from scratch compared to finetuning pretrained so-vits-4.1 model:

Model Description test 1 test 2
Original Sample / 1_gt 2_gt
so-vits-4.1 whisper feature, nsf-hifigan ,use pretrained model on VCTK, M4Singer dataset, shallow diffusion, pretrain 330k steps, finetune 110k steps 1_sovits4.1 2_sovits4.1
vitssvc contentVec feature, hifigan, train from scratch for 200k steps 1_vitssvc 2_vitssvc

How much audio data is sufficient? This is a very good research problem. Many factors such as the diversity of the data distribution, the recording quality, and the speaker characteristics can all impact. In our opinion, maybe 15-20min under good recording quality may produce a not bad result. As an evidence, you can play our huggingface demo, whose training data can be seen in this table.

c469591 commented 10 months ago

Hello, the demo you provided sounds very good, but I think most users probably don't have access to several hours of high-quality data. I've used RVC, SVC 4.1, and SVC 5 datasets which are roughly only 15 to 30 minutes long; the results I trained can't achieve the naturalness of your demo. Additionally, I just tried out the instruction manual you provided. https://huggingface.co/amphion/singing_voice_conversion I have already followed the instructions for the preliminary steps, but an error occurred in the last part. The log is quite long, so I've saved it to this file here → output.log The command I used was like this.

egs/svc/MultipleContentsSVC/run.sh --stage 3 --gpu "0" \
    --config "ckpts/svc/vocalist_l1_contentvec+whisper/args.json" \
    --infer_expt_dir "ckpts/svc/vocalist_l1_contentvec+whisper" \
    --infer_output_dir "ckpts/svc/vocalist_l1_contentvec+whisper/result" \
    --infer_source_audio_dir /opt/input/ \
    --infer_vocoder_dir "pretrained/bigvgan_singing" \
    --infer_target_speaker "vocalist_l1_TaylorSwift" \
    --infer_key_shift "autoshift"

Please tell me how to resolve this error, Additionally, could you tell me where I can download the singer data that you used to train your model? For instance, the dry vocal data of several well-known Chinese-language singers. thank you for your patient explanation!

RMSnow commented 10 months ago

Hi @c469591 , is the whole duration of your training data is just 15-30 min? If so, it is indeed difficult to learn a good model. You can utilize the some academic datasets such as opencpop, m4singer and opensinger as the supplement (download instruction).

Loading Whisper Model...
Traceback (most recent call last):
  File "/root/miniconda3/envs/amphion/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/amphion/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/root/miniconda3/envs/amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 994, in launch_command
    simple_launcher(args)
  File "/root/miniconda3/envs/amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

It seems that your output.log is incomplete (there are no bug report about internal amphion's code, but only the external accelerate subprocess bug). I suspect that there is a issue about "Loading Whisper Model...". The common bug about it is that the server tries to automatically downloading the OpenAI's whisper checkpoint into the cache, but there could be a network connection timeout error.

c469591 commented 10 months ago

Thank you for your response. So, am I to infer that the whisper he used was also invoked from the pretrained folder? Since the whisper model already exists in my path, it shouldn't need to be downloaded again. /root/Amphion/pretrained/whisper/medium.pt What else could have gone wrong?

Adorable-Qin commented 10 months ago

Thank you for your response. So, am I to infer that the whisper he used was also invoked from the pretrained folder? Since the whisper model already exists in my path, it shouldn't need to be downloaded again. /root/Amphion/pretrained/whisper/medium.pt What else could have gone wrong?

Hi @c469591 !

Could you try again and capture a complete error message in your terminal? We need complete messages like File "/Amphion/{some file}", line {any number} to pinpoint where the problem is.

c469591 commented 10 months ago

Hello, is there a good way to write output into a file? I tried using tee to write the output, but it wasn't complete. I also used screen to launch .sh scripts and write them into files, which was also incomplete.

Adorable-Qin commented 10 months ago

Hello, is there a good way to write output into a file? I tried using tee to write the output, but it wasn't complete. I also used screen to launch .sh scripts and write them into files, which was also incomplete.

I'm still not sure why the output in your logs would be incomplete. If the terminal logs don't work well, would you mind providing a simple screenshot? Then we can work on it.

RMSnow commented 7 months ago

Hi @c469591, if you have any further questions, feel free to re-open this issue. We are glad to follow up!

CreepJoye commented 6 months ago

Hello,I try to use the provided checkpoints to inference NS2,it also appears Monotonic align not found. Please make sure you have compiled it. and the quality of the reconstructed waveform is very poor,is there any relationship between them?

viewfinder-annn commented 6 months ago

Hello,I try to use the provided checkpoints to inference NS2,it also appears Monotonic align not found. Please make sure you have compiled it. and the quality of the reconstructed waveform is very poor,is there any relationship between them?

Hi @CreepJoye The monotonic_align module can be built by:

cd modules/monotonic_align
python setup.py build_ext --inplace

But this may not resolve ns2's infer quality since this issue is related to SVC task.

CreepJoye commented 6 months ago

Hello,I try to use the provided checkpoints to inference NS2,it also appears Monotonic align not found. Please make sure you have compiled it. and the quality of the reconstructed waveform is very poor,is there any relationship between them?

Hi @CreepJoye The monotonic_align module can be built by:

cd modules/monotonic_align
python setup.py build_ext --inplace

But this may not resolve ns2's infer quality since this issue is related to SVC task.

Thank you for your reply,and I also want to know why there exists a big difference between the quality of the waveform reconstructed by the reference audio downloaded from the demo page https://speechresearch.github.io/naturalspeech2/ and LJ Speech data LJ001-0001.wav?Maybe different quality of the reference audio?

ILG2021 commented 4 months ago

您好,请问MultipleContentSVC是否比如下方案: so-vits-4.1 whisper feature, nsf-hifigan ,use pretrained model on VCTK, M4Singer dataset, shallow diffusion, pretrain 330k steps, finetune 110k steps 效果更好呢?我们团队急需一个能超越so-vits-4.1的模型,我们对变声质量要求很高,找了很多, rvc, sovits 4.1, sovits 5, kits ai, audimee, 都会存在一些转换后咬字或者情感平淡的问题。如果MultipleContentSVC是最先进的方案,我们想尝试下。

Hi, is MultipleContentSVC better than the following solution: so-vits-4.1 whisper feature, nsf-hifigan, use pretrained model on VCTK, M4Singer dataset, shallow diffusion, pretrain 330k steps, finetune 110k steps?

Our team urgently needs a model that can surpass so-vits-4.1. We have very high requirements for voice conversion quality, and we have tried many, such as rvc, sovits 4.1, sovits 5, kits ai, audimee, all of which have some problems with hoarseness or emotional flatness after conversion. If MultipleContentSVC is the most advanced solution, we would like to try it.

RMSnow commented 4 months ago

@ILG2021 Thanks for your attention! We have tried the multiple content features strategy on different base models, including diffusion models and SoVITS. Some results can be found here (Table 4a).