open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.41k stars 373 forks source link

[Help]: FACodec. How to recreate demo examples for voice conversion? #161

Open Allessyer opened 5 months ago

Allessyer commented 5 months ago

Problem Overview

I tried to recreate results from demo page for FACodec: Voice Conversion Samples, but results are worse then examples provided in demo page. Why is it so? And how to achieve the same quality as from demo page samples?

Steps Taken

  1. I used the code from here. Didn't change any parameters from Encoder and Decoder. Everything as provided in code examples.
  2. I downloaded 4 wav files for prompt and for source from demo page.

Expected Outcome

Results of voice conversion are worse then in examples.

Environment Information

HeCheng0625 commented 5 months ago

Hi, which checkpoint are you using? You can follow:

from Amphion.models.codec.ns3_codec import FACodecEncoderV2, FACodecDecoderV2

# Same parameters as FACodecEncoder/FACodecDecoder
fa_encoder_v2 = FACodecEncoderV2(...)
fa_decoder_v2 = FACodecDecoderV2(...)

encoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_encoder_v2.bin")
decoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_decoder_v2.bin")

fa_encoder_v2.load_state_dict(torch.load(encoder_v2_ckpt))
fa_decoder_v2.load_state_dict(torch.load(decoder_v2_ckpt))

with torch.no_grad():
  enc_out_a = fa_encoder_v2(wav_a)
  prosody_a = fa_encoder_v2.get_prosody_feature(wav_a)
  enc_out_b = fa_encoder_v2(wav_b)
  prosody_b = fa_encoder_v2.get_prosody_feature(wav_b)

  vq_post_emb_a, vq_id_a, _, quantized, spk_embs_a = fa_decoder_v2(
      enc_out_a, prosody_a, eval_vq=False, vq=True
  )
  vq_post_emb_b, vq_id_b, _, quantized, spk_embs_b = fa_decoder_v2(
      enc_out_b, prosody_b, eval_vq=False, vq=True
  )

  vq_post_emb_a_to_b = fa_decoder_v2.vq2emb(vq_id_a, use_residual=False)
  recon_wav_a_to_b = fa_decoder_v2.inference(vq_post_emb_a_to_b, spk_embs_b)
Approximetal commented 5 months ago

Hi, which checkpoint are you using? You can follow:

from Amphion.models.codec.ns3_codec import FACodecEncoderV2, FACodecDecoderV2

# Same parameters as FACodecEncoder/FACodecDecoder
fa_encoder_v2 = FACodecEncoderV2(...)
fa_decoder_v2 = FACodecDecoderV2(...)

encoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_encoder_v2.bin")
decoder_v2_ckpt = hf_hub_download(repo_id="amphion/naturalspeech3_facodec", filename="ns3_facodec_decoder_v2.bin")

fa_encoder_v2.load_state_dict(torch.load(encoder_v2_ckpt))
fa_decoder_v2.load_state_dict(torch.load(decoder_v2_ckpt))

with torch.no_grad():
  enc_out_a = fa_encoder_v2(wav_a)
  prosody_a = fa_encoder_v2.get_prosody_feature(wav_a)
  enc_out_b = fa_encoder_v2(wav_b)
  prosody_b = fa_encoder_v2.get_prosody_feature(wav_b)

  vq_post_emb_a, vq_id_a, _, quantized, spk_embs_a = fa_decoder_v2(
      enc_out_a, prosody_a, eval_vq=False, vq=True
  )
  vq_post_emb_b, vq_id_b, _, quantized, spk_embs_b = fa_decoder_v2(
      enc_out_b, prosody_b, eval_vq=False, vq=True
  )

  vq_post_emb_a_to_b = fa_decoder_v2.vq2emb(vq_id_a, use_residual=False)
  recon_wav_a_to_b = fa_decoder_v2.inference(vq_post_emb_a_to_b, spk_embs_b)

Hi, I tried this code but the quality of the reconstructed wav seems to be poor, how should I adjust the parameters to get the best results? FACodec_test.zip

ATtendev commented 5 months ago

same here

HeCheng0625 commented 5 months ago

Hi, since our model is trained on 16KHz English data, vc performance in other languages may not be as good as shown on the demo page.

ATtendev commented 5 months ago

Is that possible to train with a new language? And How can i do it? thanks. @HeCheng0625

HeCheng0625 commented 5 months ago

Hi, you can train the codec with other languages if you have some aligned phonemes and waveforms.

wosyoo commented 5 months ago

But now I use the English source and prompt provided by the demo page to generate zero-shot voice quality is worse than that of the demo page. May I ask why?

RMSnow commented 4 months ago

Hi @wosyoo, could you attach your input and generated samples here?

lumpidu commented 4 months ago

Hi, you can train the codec with other languages if you have some aligned phonemes and waveforms.

Would love to do this, how can I ? Haven't seen any training code so far .... and I need to say: in the target language I am using, the results with the pretrained models are really bad (Icelandic)