Synthesis results of vocoder

twidddj commented 6 years ago

Single speaker

We stopped the training at 680K step. You can find some results at https://twidddj.github.io/docs/vocoder.

We tested the vocoder on the set of two group: 1) samples from the datasets 2) samples generated from Tacotron.

This is because my stupid mistake (So sorry, I did not separate the data for test).

However, I believe the result shows the performance to some extent. See first section in the page.

In other section, you can guess the performance of the vocoder.

It can generate enough as much as the target using only mel-spectrum of target.

Moreover, some part of the result has better quality than target (I hope you think so too). Note that the Tacotron was trained on sample rate = 24K audio data on the other hand our vocoder was trained sample rate = 22K. This means that the vocoder has never seen the frequencies over 11K. Therefore, If you synchronize the sample rate, your results would be better than the results we reported.

By the way, we believe the pre-trained model can be used as a teacher model for parallel wavenet.

Parallel Wavenet - Single speaker

Not yet tested.

Multi speaker

Not yet tested.

SPGB commented 6 years ago

Impressive results! Do you have any samples trained on music or guidance for someone seeking to create those samples?

twidddj commented 6 years ago

@SPGB welcome! Although we don't have tested this model on music data, we might can give you the guidance. What is the purpose of your model?

SPGB commented 6 years ago

@twidddj Thank you for the response. I'm hoping to use it to generate musical instrument sounds (for example a drum loop, or bass sounds). So far my results have mostly been static.

I know I won't be able to replicate the reference samples such as the piano (https://storage.googleapis.com/deepmind-media/pixie/making-music/sample_1.wav) but maybe with the right parameters and enough time an approximation is possible?

twidddj commented 6 years ago

You are probably interested in neural synthesizer.

Then Wavenet vocoder would help you to generate the sounds when the right encoded features are given as the local condition (like what they did). For example, you may can use pitch as a local condition and use timbre as a global condition for the vocoder.

SPGB commented 6 years ago

Thanks for sharing the link @twidddj . I've been using Ibab's wavenet implementation to some moderate success with a wide receptive field and minimal layers.

Is it possible to turn off local conditions all together and just create unconditional sounds? Something similar to python train.py --data-root=./data/cmu_arctic/ --hparams="cin_channels=-1,gin_channels=-1" from r9y9/wavenet_vocoder.

It would be really interesting if there was a way to condition it on MIDI but I wouldn't know where to begin for an addition like that.

twidddj / tf-wavenet_vocoder