Better Quality? - Githubissues

PetrochukM commented 6 years ago

Samples this repo sound too real to be true. What's the difference between these repos?

https://github.com/kan-bayashi/PytorchWaveNetVocoder

bpotard commented 6 years ago

The acoustic conditioning features are different; r9y9 uses 80 mel spectrogram features with a 62ms step, while kan-bayashi uses 28 acoustic (mcep+f0+noise) features with a 5ms step. Also, the wavenet setup is a bit different, if I am not mistaken it seems kan-bayashi uses the equivalent of: "layers": 30, "stacks": 3, "kernel_size": 2

So kan-bayashi use more acoustic features for input conditioning (a smaller vector for each frame, so a lower spectral precision but a much higher sampling rate, so a better time precision) and has a slightly larger and deeper wavenet. I believe the higher sampling rate in input conditioning explains most of the difference in acoustic quality.

But you can reproduce this setup fairly easily with https://github.com/r9y9/wavenet_vocoder and using WORLD for the acoustic feature extraction.

From my own experiments, the vocoding sounds better with the wavenet setup above, but it requires a lot of memory and time to train.

r9y9 commented 6 years ago

Anther difference is that they use noise shaping technique to reduce quantization noise, while we don't.

You can find paper abstract there: Mel-cepstrum based quantization noise shaping applied to speech synthesis based on WaveNet.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

r9y9 / wavenet_vocoder

Better Quality? #40