liusongxiang / ppg-vc

PPG-Based Voice Conversion
Apache License 2.0
328 stars 72 forks source link

question about the training of encoder-decoder #19

Closed jardnzm closed 2 years ago

jardnzm commented 2 years ago

Hi, the paper mentioned MSE loss between the predicted mel-spectrogram and ground-truth mel-spectrogram. I am wondering, if the below example is correct. A, our source speaker, has a audio saying 12345. B, our target speaker, also has a audio saying 12345, and some other audios. During training, A’s 12345 will be converted to B’s voice by a B’s audio (any audio). Then the output will be compared with B’s 12345 to compute MSE loss.

madosma commented 2 years ago

I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A

jardnzm commented 2 years ago

Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?

I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A

madosma commented 2 years ago

Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?

I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A

I mean, source audio is the speech that you want to convert, target audio is the speech that you want to convert to target, and in training, only target audio is involved, because by default source audio is the invisible speaker, Therefore, the training process is the real speech of Target and the predicted speech calculation loss, and the testing process is the conversion of Source Audio to Target Audio.

jardnzm commented 2 years ago

Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?

I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A

I mean, source audio is the speech that you want to convert, target audio is the speech that you want to convert to target, and in training, only target audio is involved, because by default source audio is the invisible speaker, Therefore, the training process is the real speech of Target and the predicted speech calculation loss, and the testing process is the conversion of Source Audio to Target Audio.

Thanks for the quick reply. I suppose the source audio (though the speaker of it can be invisible) needs to have the same content as the "real speech of Target ". Is that true?

madosma commented 2 years ago

Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?

I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A

I mean, source audio is the speech that you want to convert, target audio is the speech that you want to convert to target, and in training, only target audio is involved, because by default source audio is the invisible speaker, Therefore, the training process is the real speech of Target and the predicted speech calculation loss, and the testing process is the conversion of Source Audio to Target Audio.

Thanks for the quick reply. I suppose the source audio (though the speaker of it can be invisible) needs to have the same content as the "real speech of Target ". Is that true?

No, this paper is for no-parallel voice convertion, so it does not need the same content. Its algorithm principle is to extract content embedding using ASR encoder

jardnzm commented 2 years ago

Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?

I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A

I mean, source audio is the speech that you want to convert, target audio is the speech that you want to convert to target, and in training, only target audio is involved, because by default source audio is the invisible speaker, Therefore, the training process is the real speech of Target and the predicted speech calculation loss, and the testing process is the conversion of Source Audio to Target Audio.

Thanks for the quick reply. I suppose the source audio (though the speaker of it can be invisible) needs to have the same content as the "real speech of Target ". Is that true?

No, this paper is for no-parallel voice convertion, so it does not need the same content. Its algorithm principle is to extract content embedding using ASR encoder

Sorry I might put it in a wrong way, I definitely agree that during inference, the contents of source audio and target audio don't need to be the same. We only "borrow" the fingerprint from the target audio.

However, since the training loss is the MSE loss, which I think would make no sense if the two Mel-specgrograms to be compared have different contents.

If my assumption about "two Mel-spectrograms have same content" is correct, then one of them is the "real speech of target", another one is the converted speech. Is that true?

If so, it means the converted speech needs to have the same content as "the real speech of target". Since the converted speech has the same content as the source speech (we only change the voice not the content when converting), I suppose when preparing the training data, we will need the "parallel" instances, right?

Thanks!

madosma commented 2 years ago

your assumption about "two Mel-spectrograms have same content" is correct, one of them is the "real speech of target",but another one is the "predicted speech of target", during the training ,the converted speech's voice is the target self.You can think of it this way: during training, the transformation model combines Target's content with Target's speaker identity to generate target's speech.So there is a speaker involved in the training process, which is target itself.

jardnzm commented 2 years ago

your assumption about "two Mel-spectrograms have same content" is correct, one of them is the "real speech of target",but

another one is the "predicted speech of target", during the training ,the converted speech's voice is the target self.You can think of it this way: during training, the transformation model combines Target's content with Target's speaker identity to generate target's speech.So there is a speaker involved in the training process, which is target itself.

Got it! So after finishing the traning of the CTC and d-vector models, during the encoder-decoder training, for each training instance, we will extract the asr representation and concatenate it with the audio's own fingerprint. The output will be compared with the input audio to get the loss. Cool this will make the data collection much easier then. Really appreciate your help 😄