ttsunion / Deep-Expression

An Attention Based Open-Source End to End Speech Synthesis Framework, No CNN, No RNN, No MFCC!!!
85 stars 27 forks source link

what's is the inputs of network #4

Closed luweishuang closed 6 years ago

luweishuang commented 6 years ago

in the train.py, you get predicted wav by "ypred = sess.run(yhat, feed_dict = {x:labels, y:wavs})". I can't understand tts's input is text and output is a wav, your project do a tts mission?

FonzieTree commented 6 years ago

Hi, @luweishuang thank you for you concern. In training procress, inputs including Text (X) and Shifted Signals (Shifted Y), while in inference process, input is only Text (X), but we will predict one frame at each step like RNN using attentioned-X and already predicted Y. This great idea was proposed by "Attention is all you need, https://arxiv.org/pdf/1706.03762.pdf", in this paper, the author wrote "Most competitive neural sequence transduction models have an encoder-decoder structure. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive , consuming the previously generated symbols as additional input when generating the next". Also you can visit my previous project "https://github.com/FonzieTree/Attention-is-all-you-need" to figure out how to do DNN-based attention only using numpy. I wish my answer could help you. Soon, I will write inference function for Deep-Expression when I am not that busy.