Closed tarepan closed 1 year ago
I'm sorry for the incomplete code. For the code to select different utterance as the input to speaker encoder: `
file_path = filename
folder_path = os.path.dirname(file_path)
filenames = os.listdir(folder_path)
filenames = [f for f in filenames if f.endswith('.wav')]
random_filename = random.choice(filenames)
random_file_path = os.path.join(folder_path, random_filename)
ref_spec_filename = random_file_path.replace(".wav", ".spec.pt")
` use this as the ref_spec, and you also need to change the TextAudioSpeakerCollate in data_util
I will add a new data_util.py in the future.
For your question, the result of the paper is just like the paper said: the speaker encoder is first fed with a different utterance from the same target speaker...... In the final steps of training, the target speech input is used to fine-tune the speaker encoder So the answer is combination.
About the When switch from different to same? (e.g. at step X)
:
The final around 50k steps is trained by the same utterance as spk encoder input. Actually I'm not sure about when is the best. I'm just curious if I change the speaker encoder's input from different utterance to the same utterance, the result is better or worse. The result is nearly same, and I guess the change is only a little.
Thank you for the resonse!
Can I check my understanding?
In summary,
ref_spec
ref_spec
Is this correct?
Yes, you are right, thank you for your summary
Summary
The paper said that conditioning mel is from different utterance.
The implementation use same utterance's mel for training.
Which is correct?
Current status
The paper said that conditioning mel input to SpeakerEncoder is extracted from different utterances.
But, implementation use same utterance even at epoch 1.
https://github.com/quickvc/QuickVC-VoiceConversion/blob/277118de9c81d1689e16be8a43408eda4223553d/train.py#L148-L149
So, there is a deviation from the paper in implementation.
Question
Others
I am so impressed by the result of QuickVC, thanks for your great work!