ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.42k stars 1.29k forks source link

argmax or random.choice in generate? #347

Open HyperGD1994 opened 6 years ago

HyperGD1994 commented 6 years ago

In generate.py, it use random.choice with scaled_prediction to predict next sample, i'm confused about why it doesn't use argmax to choose the highest prediction every time?

i have tried it but it doesn't work, always silence the whole time. anyone have any idea? thanks

joe-antognini commented 6 years ago

Wavenet predicts a probability distribution for the next sample of the Waveform. In general, always picking the mode of a probability distribution will result in a sample which is very unrealistic.

To see why you are getting silence, suppose that you train on a dataset where the first 500ms is silence and then there is speech in the second 500ms. If Wavenet sees that there was silence in its input, it will predict that the next sample is very likely to be silence, but there is some small probability that it is not silence because the speech has to start somewhere, after all. If you randomly sample from this probability distribution, you will find that you get silence for a little while, and then at some point you get not silence (which will hopefully sound like speech). But if you are always picking the most likely value, you will always pick silence and you will never get speech.

Ahapy commented 4 years ago

Wavenet predicts a probability distribution for the next sample of the Waveform. In general, always picking the mode of a probability distribution will result in a sample which is very unrealistic.

To see why you are getting silence, suppose that you train on a dataset where the first 500ms is silence and then there is speech in the second 500ms. If Wavenet sees that there was silence in its input, it will predict that the next sample is very likely to be silence, but there is some small probability that it is not silence because the speech has to start somewhere, after all. If you randomly sample from this probability distribution, you will find that you get silence for a little while, and then at some point you get not silence (which will hopefully sound like speech). But if you are always picking the most likely value, you will always pick silence and you will never get speech.

How not to choose the most likely value

b7amine commented 3 years ago

Tensorflow translation example brought me here (encoder decoder ) :https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/nmt_with_attention.ipynb

I'm not very sure how @joe-antognini 's answer applies to translation but I like the idea of "always picking the mode of a probability distribution will result in a sample which is very unrealistic" especially when we're talking about human related themes such as language etc ..