the use of "reference_audio" when inference

tuanh123789 / AdaSpeech

An implementation of Microsoft's "AdaSpeech: Adaptive Text to Speech for Custom Voice"

95 stars 27 forks source link

the use of "reference_audio" when inference #9

Open HandsLing opened 1 year ago

HandsLing commented 1 year ago

hi, I want to know what's the use of "reference_audio" when inference?

HandsLing commented 1 year ago

@tuanh123789 hi can you give me some help?

freshwindy commented 1 year ago

hi, I have the same question.Do you slove it?

cantabile-kwok commented 1 year ago

Maybe because AdaSpeech only has a phone-level predictor but not utterance-level, so in inference stage you still need to input a reference mel to get an utterance-level vector. I am not sure.

cantabile-kwok commented 1 year ago

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

freshwindy commented 1 year ago

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.

cantabile-kwok commented 1 year ago

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.

I don't think we need to have a reference audio with exactly the same content (otherwise text-to-speech is useless then). In my understanding, the reference audio only provides some information about the acoustic condition (and maybe also speaker information). So providing an arbitrary utterance of the target speaker is already reasonable.

freshwindy commented 1 year ago

In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor."

Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio.

I don't think we need to have a reference audio with exactly the same content (otherwise text-to-speech is useless then). In my understanding, the reference audio only provides some information about the acoustic condition (and maybe also speaker information). So providing an arbitrary utterance of the target speaker is already reasonable.

I used a speech-level encoder during training, but I removed the reference audio when synthesising the speech, does the speech-level encoder used have an effect on the final audio?

freshwindy commented 1 year ago

Isn't the inclusion of a discourse-level encoder further enriching the modelling information?

cantabile-kwok commented 1 year ago

@freshwindy When you removed the reference audio, do you mean replace that utterance-level vector with all 0? Because there still needs to be a vector that fills the blank. I haven't done any correspondent experiments yet. About the discourse level encoder, I can't come up with a reason why it won't enrich the modeling information, just the enrichment may not be so obvious to perceive, I'm not sure : )