r9y9 / deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
https://r9y9.github.io/deepvoice3_pytorch/
Other
1.97k stars 485 forks source link

Any plan for WORLD vocoder? #7

Closed lsq357 closed 5 years ago

lsq357 commented 7 years ago

Any plan for WORLD vocoder for Multi-Speaker TTS

r9y9 commented 7 years ago

Not currently planned. I wish I had more time..

r9y9 commented 6 years ago

I will leave this open to track progress on it. Not currently planned, though.

r9y9 commented 6 years ago

Seems like there's a folk trying to support WORLD vocoder. https://github.com/geneing/deepvoice3_pytorch

DarkDefender commented 6 years ago

@r9y9 Thanks for the heads up!

I'm actually really interested in how this turns out. As the WORLD vocoder is used in the "UTAU" music software. If one managed to make the network be able to train successfully with this then I think we might be able to get rid of the "sound compression" artifacts that is present in most of the current deepvoice/tacotron implementations...

And example of the sound quality possible with UTAU (and therefore WORLD): https://www.youtube.com/watch?v=Es_5kvVtiNA

@geneing would you mind keeping us updated with your progess? Even if the results are not good.

geneing commented 6 years ago

Replacing Griffin-Lim with World vocoder seems to be fairly straightforward. Full transform for 22KHz signal is length 1027 vs 80 for mel output. World vocoder includes an encoder for aperiodicity and spectrogram, which reduces output to length of 131.

lsq357 commented 6 years ago

In my view, using WORLD vocoder, the network only changes the output shape and adds multi-output, which WORLD vocoder need at least three parameters(f0, aperiodicity, spectrogram). Moreover, it can add WORLD parameters(f0, aperiodicity spectrogram) and mel-outputs to loss function which speed convergence.(the idea is my guess!)

DarkDefender commented 6 years ago

BTW if anyone is interested in singing neural networks. Then I just found this: http://www.dtic.upf.edu/~mblaauw/NPSS/

The spanish output sounds really awesome I think. The english and japanese sounds a little bit too stilted. But I guess that depends on what kind of dataset and music you throw at it.

Edit: forgot to mention that it seems to use the WORLD vocoder

geneing commented 6 years ago

In the view of the Tacotron 2 paper, it appears that WaveNet may be a better choice. Looking into it.

lsq357 commented 6 years ago

It needs much more GPUs to train Wavenet for me(in Tacotron 2 use 32 GPUs ). And WORLD vocoder can use only in cpu.

r9y9 commented 6 years ago

Does anybody have experience working on WaveNet? Is it impossible to train WaveNet with only 1 GPU in practice?

lsq357 commented 6 years ago

I experience WaveNet on two 1080Ti GPUs, it only train 3k+ steps(asyn update) each day.,batch size =32.

I try QuasiRNN + WaveNet in DeepVoice2 or DeepVoice, but my tensorflow code of QuasiRNN not speed up! I only train a week, and not sucess.

r9y9 commented 6 years ago

I started to implement the WaveNet vocoder. Check out https://github.com/r9y9/wavenet_vocoder/issues/1#issuecomment-354586299 if you are interested.

MlWoo commented 6 years ago

@geneing Have you trained your model with "world"? Could you provide some audio samples?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

hash2430 commented 5 years ago

I made one myself. https://github.com/hash2430/dv3_world Anyone who needs it are welcome to use. I will upload sample audios soon.