Conditional Layer Normalization

rishikksh20 / AdaSpeech

AdaSpeech: Adaptive Text to Speech for Custom Voice

Apache License 2.0

157 stars 41 forks source link

Conditional Layer Normalization #2

Open Liujingxiu23 opened 3 years ago

Liujingxiu23 commented 3 years ago

Hi, I followed your work for several months and really pleasantly surprised at your speed of tracking the new algorithm. For the Adaspeech, have your verify that the two acoustic encoder really help the training of custom speakers? How it is compared to speaker-embedding generated by speaker-encoder using in speaker verification task? And for the "Conditional Layer Normalization", you have not implement it ,right? Is the following reference suitable if I realize it myself? Or Can you give amy suggest to do this? https://github.com/exe1023/CBLN/blob/e395edc2d6d952497b411f81eae4aafb96749bc2/model/cbn.py https://github.com/CyberZHG/torch-layer-normalization/blob/master/torch_layer_normalization/layer_normalization.py

hoyden commented 3 years ago

In my opinion, utterance level encoder is alternative to an extern speaker encoder model. So if you could use an extern speaker encoder model to extract speaker embedding maybe better.

rishikksh20 commented 3 years ago

@Liujingxiu23 https://github.com/CyberZHG/torch-layer-normalization/blob/master/torch_layer_normalization/layer_normalization.py this works good. Yes speaker embedding generated by speaker encoder using in speaker verification works.

Liujingxiu23 commented 3 years ago

@rishikksh20 Thank you for your reply. I am trying this and other similar methods to relize personalized-tts that use mobile phone to record audios of users. But the results are not very good, shack and unstabitily are the main problems of synthesized wavs. I am wondering if it is the problem of vocoder, I could not find a universal vocoder using deep learning method.

MMingabc commented 2 years ago

My experiments showd that in a multi-speaker senario the phoneme level mel encoder encodes too much infomation. As a consequence if the phoneme level predictor is not capable enough the performance drops a lot.