r9y9 / wavenet_vocoder

WaveNet vocoder
https://r9y9.github.io/wavenet_vocoder/
Other
2.31k stars 500 forks source link

About generation and input&output types #93

Closed evinpinar closed 6 years ago

evinpinar commented 6 years ago

Hey, thank you for your code, it is very helpful to understand how the model works. I am implementing the wavenet on my own from scratch and have some questions:

r9y9 commented 6 years ago

For the second question, it's just a sampling from the categorical distribution conditioned on previously generated samples; x_{t} ~ p(x_{t} | x_1, x_2, ..., x_{t-1}, c_1, c_2, ..., c_{T}), where x_{t} and c_{t} is a sample and conditional feature at time t, respectively. https://towardsdatascience.com/the-softmax-function-neural-net-outputs-as-probabilities-and-ensemble-classifiers-9bd94d75932 might help you understand what the softmax output does.

As for the first question, you might want to see if your incremental generation is correct, but it could happen if your dataset has a lot of silence regions and your model might fitted to the silence regions, resulting in generating only silences.

evinpinar commented 6 years ago

For second question: Yeah i did not get why you choose a random sample instead of the one with maximum value, which has the highest probability.

First question: Even though I do standard trimming on LJspeech dataset, I might need to check again. Otherwise, probably i have an implementation error..

r9y9 commented 6 years ago

We want to sample from a generative model. Choosing a sample with the highest probability makes more sense in classification tasks.

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made.

evinpinar commented 6 years ago

Oh i see, you are randomly sampling from the distribution, not directly getting the result with the maximum value. Thanks!