Open HandsLing opened 2 years ago
@tuanh123789 hi can you give me some help?
hi, I have the same question.Do you slove it?
Maybe because AdaSpeech only has a phone-level predictor but not utterance-level, so in inference stage you still need to input a reference mel to get an utterance-level vector. I am not sure.
In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."
In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."
Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.
In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."
Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.
I don't think we need to have a reference audio with exactly the same content (otherwise text-to-speech is useless then). In my understanding, the reference audio only provides some information about the acoustic condition (and maybe also speaker information). So providing an arbitrary utterance of the target speaker is already reasonable.
In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."
Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.
I don't think we need to have a reference audio with exactly the same content (otherwise text-to-speech is useless then). In my understanding, the reference audio only provides some information about the acoustic condition (and maybe also speaker information). So providing an arbitrary utterance of the target speaker is already reasonable.
I used a speech-level encoder during training, but I removed the reference audio when synthesising the speech, does the speech-level encoder used have an effect on the final audio?
Isn't the inclusion of a discourse-level encoder further enriching the modelling information?
@freshwindy When you removed the reference audio, do you mean replace that utterance-level vector with all 0? Because there still needs to be a vector that fills the blank. I haven't done any correspondent experiments yet. About the discourse level encoder, I can't come up with a reason why it won't enrich the modeling information, just the enrichment may not be so obvious to perceive, I'm not sure : )
hi, I want to know what's the use of "reference_audio" when inference?