Deviation from paper: mel-spec from same utterance

tarepan commented 1 year ago

Summary

The paper said that conditioning mel is from different utterance.
The implementation use same utterance's mel for training.
Which is correct?

Current status

The paper said that conditioning mel input to SpeakerEncoder is extracted from different utterances.

In the training process, the speaker encoder is first fed with a different utterance from the same target speaker.

But, implementation use same utterance even at epoch 1.

https://github.com/quickvc/QuickVC-VoiceConversion/blob/277118de9c81d1689e16be8a43408eda4223553d/train.py#L148-L149

So, there is a deviation from the paper in implementation.

Question

For result/demo of the paper, which method (same OR different OR their combination) is used?
If combined, how can we reproduce the results?
- When switch from different to same? (e.g. at step X)

Others

I am so impressed by the result of QuickVC, thanks for your great work!

quickvc commented 1 year ago

I'm sorry for the incomplete code. For the code to select different utterance as the input to speaker encoder: `

    file_path = filename
    folder_path = os.path.dirname(file_path)
    filenames = os.listdir(folder_path)
    filenames = [f for f in filenames if f.endswith('.wav')]
    random_filename = random.choice(filenames)
    random_file_path = os.path.join(folder_path, random_filename)
    ref_spec_filename = random_file_path.replace(".wav", ".spec.pt")

` use this as the ref_spec, and you also need to change the TextAudioSpeakerCollate in data_util

I will add a new data_util.py in the future.

For your question, the result of the paper is just like the paper said: the speaker encoder is first fed with a different utterance from the same target speaker...... In the final steps of training, the target speech input is used to fine-tune the speaker encoder So the answer is combination.

About the When switch from different to same? (e.g. at step X): The final around 50k steps is trained by the same utterance as spk encoder input. Actually I'm not sure about when is the best. I'm just curious if I change the speaker encoder's input from different utterance to the same utterance, the result is better or worse. The result is nearly same, and I guess the change is only a little.

tarepan commented 1 year ago

Thank you for the resonse!

Can I check my understanding?
In summary,

Use combination (early-different & late-same) as in the paper
- late == final around 50k steps
- could be little effect on the result
- Dataset: acquire directory of .wav file, then random pick the child (same speaker) .wav file as ref_spec
- Collate_fn: handle new ref_spec

Is this correct?

quickvc commented 1 year ago

Yes, you are right, thank you for your summary

quickvc / QuickVC-VoiceConversion