Closed jardnzm closed 2 years ago
I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A
Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?
I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A
Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?
I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A
I mean, source audio is the speech that you want to convert, target audio is the speech that you want to convert to target, and in training, only target audio is involved, because by default source audio is the invisible speaker, Therefore, the training process is the real speech of Target and the predicted speech calculation loss, and the testing process is the conversion of Source Audio to Target Audio.
Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?
I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A
I mean, source audio is the speech that you want to convert, target audio is the speech that you want to convert to target, and in training, only target audio is involved, because by default source audio is the invisible speaker, Therefore, the training process is the real speech of Target and the predicted speech calculation loss, and the testing process is the conversion of Source Audio to Target Audio.
Thanks for the quick reply. I suppose the source audio (though the speaker of it can be invisible) needs to have the same content as the "real speech of Target ". Is that true?
Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?
I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A
I mean, source audio is the speech that you want to convert, target audio is the speech that you want to convert to target, and in training, only target audio is involved, because by default source audio is the invisible speaker, Therefore, the training process is the real speech of Target and the predicted speech calculation loss, and the testing process is the conversion of Source Audio to Target Audio.
Thanks for the quick reply. I suppose the source audio (though the speaker of it can be invisible) needs to have the same content as the "real speech of Target ". Is that true?
No, this paper is for no-parallel voice convertion, so it does not need the same content. Its algorithm principle is to extract content embedding using ASR encoder
Sorry I did not get it… do you mean during training, A’s voice first be converted to B’s voice (the target speaker), then it is converted back to A’s voice to compare with the original?
I think, during training, only involved source speaker,A’s 12345 will be converted to A’s voice,The predicted audio A will be compared with groud Truch A
I mean, source audio is the speech that you want to convert, target audio is the speech that you want to convert to target, and in training, only target audio is involved, because by default source audio is the invisible speaker, Therefore, the training process is the real speech of Target and the predicted speech calculation loss, and the testing process is the conversion of Source Audio to Target Audio.
Thanks for the quick reply. I suppose the source audio (though the speaker of it can be invisible) needs to have the same content as the "real speech of Target ". Is that true?
No, this paper is for no-parallel voice convertion, so it does not need the same content. Its algorithm principle is to extract content embedding using ASR encoder
Sorry I might put it in a wrong way, I definitely agree that during inference, the contents of source audio and target audio don't need to be the same. We only "borrow" the fingerprint from the target audio.
However, since the training loss is the MSE loss, which I think would make no sense if the two Mel-specgrograms to be compared have different contents.
If my assumption about "two Mel-spectrograms have same content" is correct, then one of them is the "real speech of target", another one is the converted speech. Is that true?
If so, it means the converted speech needs to have the same content as "the real speech of target". Since the converted speech has the same content as the source speech (we only change the voice not the content when converting), I suppose when preparing the training data, we will need the "parallel" instances, right?
Thanks!
your assumption about "two Mel-spectrograms have same content" is correct, one of them is the "real speech of target",but another one is the "predicted speech of target", during the training ,the converted speech's voice is the target self.You can think of it this way: during training, the transformation model combines Target's content with Target's speaker identity to generate target's speech.So there is a speaker involved in the training process, which is target itself.
your assumption about "two Mel-spectrograms have same content" is correct, one of them is the "real speech of target",but
another one is the "predicted speech of target", during the training ,the converted speech's voice is the target self.You can think of it this way: during training, the transformation model combines Target's content with Target's speaker identity to generate target's speech.So there is a speaker involved in the training process, which is target itself.
Got it! So after finishing the traning of the CTC and d-vector models, during the encoder-decoder training, for each training instance, we will extract the asr representation and concatenate it with the audio's own fingerprint. The output will be compared with the input audio to get the loss. Cool this will make the data collection much easier then. Really appreciate your help 😄
Hi, the paper mentioned MSE loss between the predicted mel-spectrogram and ground-truth mel-spectrogram. I am wondering, if the below example is correct. A, our source speaker, has a audio saying 12345. B, our target speaker, also has a audio saying 12345, and some other audios. During training, A’s 12345 will be converted to B’s voice by a B’s audio (any audio). Then the output will be compared with B’s 12345 to compute MSE loss.