Open Liujingxiu23 opened 3 years ago
In my opinion, utterance level encoder is alternative to an extern speaker encoder model. So if you could use an extern speaker encoder model to extract speaker embedding maybe better.
@Liujingxiu23 https://github.com/CyberZHG/torch-layer-normalization/blob/master/torch_layer_normalization/layer_normalization.py this works good. Yes speaker embedding generated by speaker encoder using in speaker verification works.
@rishikksh20 Thank you for your reply. I am trying this and other similar methods to relize personalized-tts that use mobile phone to record audios of users. But the results are not very good, shack and unstabitily are the main problems of synthesized wavs. I am wondering if it is the problem of vocoder, I could not find a universal vocoder using deep learning method.
My experiments showd that in a multi-speaker senario the phoneme level mel encoder encodes too much infomation. As a consequence if the phoneme level predictor is not capable enough the performance drops a lot.
Hi, I followed your work for several months and really pleasantly surprised at your speed of tracking the new algorithm. For the Adaspeech, have your verify that the two acoustic encoder really help the training of custom speakers? How it is compared to speaker-embedding generated by speaker-encoder using in speaker verification task? And for the "Conditional Layer Normalization", you have not implement it ,right? Is the following reference suitable if I realize it myself? Or Can you give amy suggest to do this? https://github.com/exe1023/CBLN/blob/e395edc2d6d952497b411f81eae4aafb96749bc2/model/cbn.py https://github.com/CyberZHG/torch-layer-normalization/blob/master/torch_layer_normalization/layer_normalization.py