Reproducing good results (as claimed in paper)

liusongxiang / efficient_tts

Pytorch implementation of "Efficienttts: an efficient and high-quality text-to-speech architecture"

MIT License

115 stars 21 forks source link

Reproducing good results (as claimed in paper) #6

Open ctlaltdefeat opened 3 years ago

ctlaltdefeat commented 3 years ago

Somewhat related to issue #2 which was closed, but I think it's safe to say that the latest samples posted do not seem to be close to converging towards the strong results that were claimed by the paper's authors, and it would be good to have an issue tracking speech quality.

It's somewhat puzzling given that the implementation seems to be on point except for the missing hyperparameter sigma values that you mentioned. I'm doing my own experiments playing with hyperparameters but haven't been able so far to achieve something too competitive. If you have any ideas of what could be tried, let me know.

liusongxiang commented 3 years ago

Thank you very much for the attention and sorry for this late reply. I contacted with the authors of the paper. I'd like to posted their reply here for your reference:

Sigmas in Equation 14 and 17 are 0.2 and 0.1, respectively
Text encoder does not have two output streams, i.e., key = value.
Hidden dimensionalities of the position predictor is 384, 256.
Input text sequences have \<space> as leading and tailing tokens.
The authors use a dropout rate of 0.2.
LeakyReLU has negative slope 0.2.

However, the generated samples uploaded in this repo are the best ones I have got (the yaml config file lies in the egs folder). Hope this can help us to obtain better results.

ctlaltdefeat commented 3 years ago

I've been trying these values without much luck. Do we know where the authors used dropout? Perhaps dropout was used only in some of the layers of some of the components.

attitudechunfeng commented 3 years ago

I've been trying these values without much luck. Do we know where the authors used dropout? Perhaps dropout was used only in some of the layers of some of the components.

Similar results, there may be some other tricks not claimed in the paper.

Liujingxiu23 commented 3 years ago

@liusongxiang Did you train the dataset of Biaobei using the same config as ./egs/lj/conf/efficient_tts_cnn_phnseq_noDropout.v1.yaml?

liusongxiang commented 3 years ago

@Liujingxiu23 Yes, exactly.

liusongxiang commented 3 years ago

@Liujingxiu23 Thanks for your attention. I haven't try the end2end training yet since I have been stuck in other things. If you are interested, I think you could try by combining this repo with the ParallelWaveGAN repo.

attitudechunfeng commented 3 years ago

@liusongxiang I see you implement 2 delta_e prediction methods and which delta e prediction method do you use? delta_e_method_1 or another one?

liusongxiang commented 3 years ago

Hi all, all the parameters are shown as in efficient_tts/egs/lj/conf/efficient_tts_cnn_phnseq_noDropout.v1.yaml . I have tried other settings, but this seems like the best one for both LJSpeech and Biaobei data set.