Open Allessyer opened 5 months ago
Hi, which checkpoint are you using? You can follow:
from Amphion.models.codec.ns3_codec import FACodecEncoderV2, FACodecDecoderV2
# Same parameters as FACodecEncoder/FACodecDecoder
fa_encoder_v2 = FACodecEncoderV2(...)
fa_decoder_v2 = FACodecDecoderV2(...)
encoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_encoder_v2.bin")
decoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_decoder_v2.bin")
fa_encoder_v2.load_state_dict(torch.load(encoder_v2_ckpt))
fa_decoder_v2.load_state_dict(torch.load(decoder_v2_ckpt))
with torch.no_grad():
enc_out_a = fa_encoder_v2(wav_a)
prosody_a = fa_encoder_v2.get_prosody_feature(wav_a)
enc_out_b = fa_encoder_v2(wav_b)
prosody_b = fa_encoder_v2.get_prosody_feature(wav_b)
vq_post_emb_a, vq_id_a, _, quantized, spk_embs_a = fa_decoder_v2(
enc_out_a, prosody_a, eval_vq=False, vq=True
)
vq_post_emb_b, vq_id_b, _, quantized, spk_embs_b = fa_decoder_v2(
enc_out_b, prosody_b, eval_vq=False, vq=True
)
vq_post_emb_a_to_b = fa_decoder_v2.vq2emb(vq_id_a, use_residual=False)
recon_wav_a_to_b = fa_decoder_v2.inference(vq_post_emb_a_to_b, spk_embs_b)
Hi, which checkpoint are you using? You can follow:
from Amphion.models.codec.ns3_codec import FACodecEncoderV2, FACodecDecoderV2 # Same parameters as FACodecEncoder/FACodecDecoder fa_encoder_v2 = FACodecEncoderV2(...) fa_decoder_v2 = FACodecDecoderV2(...) encoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_encoder_v2.bin") decoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_decoder_v2.bin") fa_encoder_v2.load_state_dict(torch.load(encoder_v2_ckpt)) fa_decoder_v2.load_state_dict(torch.load(decoder_v2_ckpt)) with torch.no_grad(): enc_out_a = fa_encoder_v2(wav_a) prosody_a = fa_encoder_v2.get_prosody_feature(wav_a) enc_out_b = fa_encoder_v2(wav_b) prosody_b = fa_encoder_v2.get_prosody_feature(wav_b) vq_post_emb_a, vq_id_a, _, quantized, spk_embs_a = fa_decoder_v2( enc_out_a, prosody_a, eval_vq=False, vq=True ) vq_post_emb_b, vq_id_b, _, quantized, spk_embs_b = fa_decoder_v2( enc_out_b, prosody_b, eval_vq=False, vq=True ) vq_post_emb_a_to_b = fa_decoder_v2.vq2emb(vq_id_a, use_residual=False) recon_wav_a_to_b = fa_decoder_v2.inference(vq_post_emb_a_to_b, spk_embs_b)
Hi, I tried this code but the quality of the reconstructed wav seems to be poor, how should I adjust the parameters to get the best results? FACodec_test.zip
same here
Hi, since our model is trained on 16KHz English data, vc performance in other languages may not be as good as shown on the demo page.
Is that possible to train with a new language? And How can i do it? thanks. @HeCheng0625
Hi, you can train the codec with other languages if you have some aligned phonemes and waveforms.
But now I use the English source and prompt provided by the demo page to generate zero-shot voice quality is worse than that of the demo page. May I ask why?
Hi @wosyoo, could you attach your input and generated samples here?
Hi, you can train the codec with other languages if you have some aligned phonemes and waveforms.
Would love to do this, how can I ? Haven't seen any training code so far .... and I need to say: in the target language I am using, the results with the pretrained models are really bad (Icelandic)
Problem Overview
I tried to recreate results from demo page for FACodec: Voice Conversion Samples, but results are worse then examples provided in demo page. Why is it so? And how to achieve the same quality as from demo page samples?
Steps Taken
Expected Outcome
Results of voice conversion are worse then in examples.
Environment Information