Closed youngsuenXMLY closed 4 years ago
Yes.
In our method, a speaker encoder is adopted in pre-training stage. After you finish the pre-training, theoretically, you can inference a speaker code by passing Mel-spectrogram through the speaker encoder.
However, as describle in our paper, we introduce speaker embedding during fine-tuning. And output of speaker encoder is only used for initializing the weights of speaker embedding. The reason is that we found this method produced better results. Maybe our speaker encoder is not powerful enough for giving universal speaker embedding. Also, as far as I know, dataset with thousands of speakers often is used for extracting d-vector. Therefore, training data is also an important factor to be taken account of.
The one-hot speaker embedding is simple but is applicable in limited scenarios. Is there any method for universal speaker embedding?