Why does the BNE-PPG-VC model in your demo perform better than the pre-trained model given in the original paper?

Hi, @jiazj-jiazj !

No, we didn't fine-tune the HiFi-GAN vocoder, we just took the code from this repo as is, with the vocoder checkpoint they provided and the recommended bneSeq2seqMoL-vctk-libritts460-oneshot model. I'm not sure what might go wrong and why you had the results of voice conversion with this model worse the ones from our demo.

I used this model to reproduce the experiment you described (just took source and reference voice samples from our demo) and got the same BNE-PPG-VC results as in our demo. See the results of my experiments here.

Perhaps the reason is that the output audios produced by the voice conversion model from the mentioned repo were loudness-normalized and only then put to our demo here, so in the demo the loudness might be less so the quality might seem better. Also note that in our demo we have 16kHz audio while the BNE-PPG-VC model outputs 24kHZ, so we also downsampled the audio before putting it to our demo. These are the only things I can think of.

huawei-noah / Speech-Backbones

Why does the BNE-PPG-VC model in your demo perform better than the pre-trained model given in the original paper? #20