r9y9 / wavenet_vocoder

WaveNet vocoder
https://r9y9.github.io/wavenet_vocoder/
Other
2.32k stars 499 forks source link

why not use greedy decoding? #111

Closed wenyong-h closed 5 years ago

wenyong-h commented 6 years ago

Similar issues: #93 #3 . In most supervised sequence to sequence task, e.g Neural machine translation, we use greedy decoding or beam search to find the most likely output sequence. It's weird to use a sampling based decoding here, as we have the risk to sample a bad (low probability) wav sequence. Since at training time we are maximizing the probability of the target wav sequence, at inference time, we should also try to select the sequence with max probability .

wenyong-h commented 6 years ago

As for the Deepmind blog post mentioning decoding based on sampling , https://deepmind.com/blog/wavenet-generative-model-raw-audio/. I think they are describing the unconditional version of wavenet, sampling is reasonable in unconditional case.

jiqizaisikao commented 6 years ago

Yes,i have the same doubts as you,but the peak of the distribution is always sharp,Therefore, there are usually no serious problems.But i think that we should choose the center value of the peak as the ouput.

geneing commented 6 years ago

For a single gaussian as output distribution (for parallel_wavenet_encoder) I've tried greedy decoding for synthesis. Resulting sound is much lower quality than using sampling from the gaussian distribution. It makes no sense!

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

TheButlah commented 5 years ago

Wouldn't the ideal distribution be multimodal? Hence we don't want to greedily take the argmax of the probabilities, we want to sample instead.

I agree however that merely sampling the probabilities gives a chance of getting some low quality outputs. Here is an idea for how to do avoid that, let me know what you think:

Goal: threshold the probabilities to only accept samples above some predetermined threshold to ensure high quality samples Issue: how do you set the threshold? More importantly, what happens if the probabilities are unusually spread out and the threshold causes you to only pick from the largest peak, or even worse there are no probabilities that get past the threshold?

There are two solutions I could think of for this, and one simpler one that seems to not fully address the issues i just listed:

Solution Approach 1:

  1. Find maximum probability value p_max. Multiply by some fraction f that is akin to the originally proposed threshold probability. c=p_max*f gives you the cutoff for probabilities you consider to be too low.
  2. Generate a boolean mask for all elements of the tensor below c. These represent the elements we want to not select.
  3. Sum up the values of the elements below the threshold. We will redistribute this to the remaining probabilities in a later step
  4. Multiply the mask with the original tensor to zero out the probabilities that fall below the cutoff probability
  5. Sum together the elements of the mask. This tells us the number of elements that were above the cutoff.
  6. Divide the sum from step 3 by the sum from step 4 to compute the average probability to redistribute
  7. Add those probabilities to the elements of the tensor that were larger than the cutoff
  8. Sample from this new probability tensor. This ensures that all probabilities sum to one, and that you don't select any probabilities below the cutoff!

Solution Approach 2: Instead of having a probability threshold, we will use some top percentage of the entries.

  1. Select top k entries (using torch.topk or tf.top_k), where k is based on your own judgement of what proportion/number of the probabilities will be high. Note that because (at least tensorflow) uses radix sort, this is a nearly-linear (better than n logn) operation (its actually n log_n(k))
  2. Sum those k probabilities, call it s. This will be a value less than one due to the other probabilities not bein in the sum
  3. Compute the average probability to redistribute as (1-s)/k
  4. Add this value to each of the k probabilities. The top k probabilities now all add to one
  5. Sample from the k items!

Fast (?) but bad method:

  1. Sample from the probabilities
  2. If the probability you sampled is lower than a cutoff c, resample
  3. Do this until you get a probability higher than c. If you have resampled several times, use the argmax as a fallback option.

You would have to benchmark these approaches to see which is fastest, and to see if the increased performance cost is worth the benefits associated with the lack of low probability samples.

There might be some other more mathematically sound way of doing this, but this makes sense to me :)

TheButlah commented 5 years ago

OK after learning about mixture models, I've realized that none of the above is necessary as long as you are using the Mixture of Logistics (MoL) version of WaveNet rather than the softmax one, which is active by default as long as the input type is not "mulaw-quantize". With out_channels = 10 * 3, this is equivalent to the output distribution being a mixture of ten logsitic distributions, where each distribution has 3 parameters: mu, log_scale, and the mixture coefficient (weighting coefficient for that distribution) pi.

Each logistic distribution can be considered as trying to account for the different modes in a potentially multi-modal prediction. pi controls the relative likelihoods of these modes, mu controls where the modes are located, and log_scale will have an impact on the width of the peaks.

Ideally, a properly trained network will have a large value for the different log_scale parameters, making the width of the modes small. This means that just sampling from the distribution is not a big deal as @jiqizaisikao has stated. The width of the peaks should be small, and the likelihood of getting a bad sample is low.

However, if you're still concerned, then simply do the following:

The approach outlined previously for softmax distributions is slow, and we take take advantage of the structure of MoL distributions to modify the ideas presented above to be appropriate.

  1. find top_k of the output channels corresponding to the different pi parameters, where k is the number of peaks you want to consider
  2. Normalize the resulting k probabilities to sum to one
  3. Sample from a k-way categorical distribution parameterized by the normalized pi values to decide which peak to select (this is fast)
  4. Use the mu (mean=mode=median) value of that peak that you selected by indexing the output channels
stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.