sh-lee-prml / HierSpeechpp

The official implementation of HierSpeech++
MIT License
1.16k stars 134 forks source link

Which VC model shall I use for crosslingual VC? #52

Open sekkit opened 1 month ago

sekkit commented 1 month ago

As the tutorial wrote:

--ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460

--ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960

--ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)

--ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference_vc.py \ --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ --output_dir "vc_results_eng_kor_v2" \ --noise_scale_vc "0.333" \ --noise_scale_ttv "0.333" \ --denoise_ratio "0"

Which VC model shall I use is the best for crosslingual VC?

sh-lee-prml commented 1 month ago

hierspeechpp_v1.1_ckpt.pth would be great.

Additionally, you can control the noise_scale_vc during voice conversion. (0.333 would be more robust and 667 would be more diverse.

sekkit commented 1 month ago

thx. And I wanna ask what's denoise ratio for, when I set it larger than 0, it always throw CUDA out of memory exception. Even I'm using 4090 24G GPU.

sh-lee-prml commented 1 month ago

This might solves that issue

        # If you have a memory issue during denosing the prompt, try to denoise the prompt with cpu before TTS 
        # We will have a plan to replace a memory-efficient denoiser 
        if denoise == 0:
            audio = torch.cat([audio.cuda(), audio.cuda()], dim=0)
        else:
            with torch.no_grad():

                if ori_prompt_len > 80000:
                    denoised_audio = []
                    for i in range((ori_prompt_len//80000)):
                        denoised_audio.append(denoise(audio.squeeze(0).cuda()[i*80000:(i+1)*80000], denoiser, hps_denoiser))

                    denoised_audio.append(denoise(audio.squeeze(0).cuda()[(i+1)*80000:], denoiser, hps_denoiser))
                    denoised_audio = torch.cat(denoised_audio, dim=1)
                else:
                    denoised_audio = denoise(audio.squeeze(0).cuda(), denoiser, hps_denoiser)

            audio = torch.cat([audio.cuda(), denoised_audio[:,:audio.shape[-1]]], dim=0)

        audio = audio[:,:ori_prompt_len]  # 20231108 We found that large size of padding decreases a performance so we remove the paddings after denosing.
sekkit commented 1 month ago

ok, thx, it worked