quickvc / QuickVC-VoiceConversion

QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion
MIT License
227 stars 26 forks source link

Deviation from paper: mel-spec from same utterance #10

Closed tarepan closed 1 year ago

tarepan commented 1 year ago

Summary

The paper said that conditioning mel is from different utterance.
The implementation use same utterance's mel for training.
Which is correct?

Current status

The paper said that conditioning mel input to SpeakerEncoder is extracted from different utterances.

In the training process, the speaker encoder is first fed with a different utterance from the same target speaker.

But, implementation use same utterance even at epoch 1.

https://github.com/quickvc/QuickVC-VoiceConversion/blob/277118de9c81d1689e16be8a43408eda4223553d/train.py#L148-L149

So, there is a deviation from the paper in implementation.

Question

Others

I am so impressed by the result of QuickVC, thanks for your great work!

quickvc commented 1 year ago

I'm sorry for the incomplete code. For the code to select different utterance as the input to speaker encoder: `

    file_path = filename
    folder_path = os.path.dirname(file_path)
    filenames = os.listdir(folder_path)
    filenames = [f for f in filenames if f.endswith('.wav')]
    random_filename = random.choice(filenames)
    random_file_path = os.path.join(folder_path, random_filename)
    ref_spec_filename = random_file_path.replace(".wav", ".spec.pt")

` use this as the ref_spec, and you also need to change the TextAudioSpeakerCollate in data_util

I will add a new data_util.py in the future.

For your question, the result of the paper is just like the paper said: the speaker encoder is first fed with a different utterance from the same target speaker...... In the final steps of training, the target speech input is used to fine-tune the speaker encoder So the answer is combination.

About the When switch from different to same? (e.g. at step X): The final around 50k steps is trained by the same utterance as spk encoder input. Actually I'm not sure about when is the best. I'm just curious if I change the speaker encoder's input from different utterance to the same utterance, the result is better or worse. The result is nearly same, and I guess the change is only a little.

tarepan commented 1 year ago

Thank you for the resonse!

Can I check my understanding?
In summary,

Is this correct?

quickvc commented 1 year ago

Yes, you are right, thank you for your summary