About generation and input&output types

r9y9 / wavenet_vocoder

WaveNet vocoder

https://r9y9.github.io/wavenet_vocoder/

Other

2.31k stars 500 forks source link

About generation and input&output types #93

Closed evinpinar closed 6 years ago

evinpinar commented 6 years ago

Hey, thank you for your code, it is very helpful to understand how the model works. I am implementing the wavenet on my own from scratch and have some questions:

I give scalar inputs to the model, and quantized targets to find out the cross entropy loss. Then in generation, i decode the output by mulaw into scalar value in range [-1,1], expand the generation audio and refeed into model for next sample. It works when I try to overfit a small data and generate. As well, when I train with whole dataset, the loss decreases down. However, when generating after ~700K steps without conditioning, it only produces a constant value. What could be the problem?
In your code, I see this part in incremental generation: x = F.softmax(x.view(B, -1), dim=1) if softmax else x.view(B, -1) if quantize: sample = np.random.choice( np.arange(self.out_channels), p=x.view(-1).data.cpu().numpy()) I am not sure why you randomly choose a value here.

r9y9 commented 6 years ago

For the second question, it's just a sampling from the categorical distribution conditioned on previously generated samples; x_{t} ~ p(x_{t} | x_1, x_2, ..., x_{t-1}, c_1, c_2, ..., c_{T}), where x_{t} and c_{t} is a sample and conditional feature at time t, respectively. https://towardsdatascience.com/the-softmax-function-neural-net-outputs-as-probabilities-and-ensemble-classifiers-9bd94d75932 might help you understand what the softmax output does.

As for the first question, you might want to see if your incremental generation is correct, but it could happen if your dataset has a lot of silence regions and your model might fitted to the silence regions, resulting in generating only silences.

evinpinar commented 6 years ago

For second question: Yeah i did not get why you choose a random sample instead of the one with maximum value, which has the highest probability.

First question: Even though I do standard trimming on LJspeech dataset, I might need to check again. Otherwise, probably i have an implementation error..

r9y9 commented 6 years ago

We want to sample from a generative model. Choosing a sample with the highest probability makes more sense in classification tasks.

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made.

evinpinar commented 6 years ago

Oh i see, you are randomly sampling from the distribution, not directly getting the result with the maximum value. Thanks!