Closed rishikksh20 closed 6 years ago
@m-toman you used n_fft
or fft_size
for wavernn = 2048 and for Tacotron-2 =1024 , I think both will be same.
And also lws
required to preserve phase.
I haven't even added Tacotron code to the repository yet, but I'm training a Taco-model with the default settings (which is fft_size=2048 by default as in https://github.com/Rayhane-mamah/Tacotron-2/blob/master/hparams.py#L27)
I'm not extracting mel-spectra separately for Taco and WaveRNN but training WaveRNN from the (transposed) GTA output of Tacotron.
My first quick experiments already produced speech, but I now have to run everything for longer, then I can continue implementing the glue code to make it more convenient.
Ok got it, @m-toman if you need some computation help I have two gtx 1080 ti , I can also train your model on my PC if you write training code and re-training code. Otherwise, I am also working with Tacotron + WaveRNN project but I am currently more focus on Tacotron 1 (by keithito) + WaveRNN.
Nice, yeah I think nearly all the repos started out with the Keithito implementation. Just noticed that the sample rate in the implementation I use is now 24000: https://github.com/Rayhane-mamah/Tacotron-2/blob/master/hparams.py#L30 this should explain my bit weird results. And not only because of the mismatch but probably also because of the upsampling network in https://github.com/fatchord/WaveRNN/blob/master/NB5b%20-%20Alternative%20Model%20(Training).ipynb
I now got the https://github.com/Rayhane-mamah/Tacotron-2 repo, changed the params to 22050 (+ hop_size etc.) again, training the LJ data set to some reasonable state and then convert the GTA features like this: https://github.com/m-toman/tacorn/blob/5f851665cdac82b6434c8983d588cc85a9a2296e/wavernn/preprocess.py#L84
I know it's still extremely messy but want to see/hear some results first before putting in more work.
I run on a single GTX 1080 Ti at the moment, but probably still better than transferring all the GTA features.
Getting following error while training
Traceback (most recent call last):
File "train.py", line 111, in <module>
x, m, y = next(iter(data_loader))
File "/home/humonics/.virtualenvs/tf16/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 314, in __next__
batch = self.collate_fn([self.dataset[i] for i in indices])
File "train.py", line 92, in collate
coarse = np.stack(coarse).astype(np.int64)
File "/home/humonics/.virtualenvs/tf16/lib/python3.6/site-packages/numpy/core/shape_base.py", line 354, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape
Does some kind of padding required ?
Hi, I've trained a new Taco model over the weekend to 80k iterations and now training WaveRNN on GTA mels. Both on 22050Hz. Not sure yet if it will produce something legit, but described my current process in the README and at 17k steps the speech is at least intelligible.
Looks good, by the way, could you share 80k Tacotron 2 pretrained model I would like to train it more around 300k.
I have started training WaveRNN with Tacotron 1, by the what is the inference time of this WaveRNN is it real time? Also is 900 epochs are enough for good result?
Here is my pretrained taco model: https://www.dropbox.com/s/5svv16eolba0i7o/logs-Tacotron-2.zip?dl=0 Should work if you just put the content into the Tacotron-2 folder, as well as the hparams from this repo (https://github.com/m-toman/tacorn/blob/master/config/hparams.py). And of course you'll have to get the LJ corpus and run preprocess.py from Tacotron.
Thanks for that I have about to complete training my model with Taco1 just start working with inference, is this model inference real time ?
From my limited experience until now: no, seemed to take about a minute for a longer sentence. But still much faster than most Wavenet implementations out there.
I haven't tried the Nvidia Realtime implementation yet.
My first sample : https://drive.google.com/open?id=1xsflF0OPu2f2JISUBOfprxqov6ljgmvZ
Model :
Tacotron 1 [https://github.com/keithito/tacotron] with pretrained model provide on README
WaveRNN of this repo with 1000 epochs (205k steps) of training
Generated sample is bit noisy, I think it requires more training.
Hmm, do you use a smaller batch size or less data? Because I'm at step 469k and this is only epoch 576 (816 elements per batch).
I'm now seeing pretty nice improvements, here are the samples generated from the GTA input: https://www.dropbox.com/sh/2gtunx8d1r92fqb/AADh9CJEtvHnQ7YlwNClk8X5a?dl=0 I did not run it end-to-end yet.
Here my current WaveRNN models: https://www.dropbox.com/sh/ruq9elymhh9cyjl/AAD8u_PefFz_qwiAckqwqGzwa?dl=0
I used LJSPEECH dataset with 64 batch size (204 elements per batch), it seems that you first used Tacotron 2 to predict mels files for all sentences for LJSpeech then used that predicted mels to train WaveRNN rather than original mels which used to train Tacotron 2.
@m-toman is there any way to do sampling at the real-time because on WaveRNN paper they mentioned that it requires some kind of GPU optimization and subscale for that. Right now I get around 1500 samples/sec on paper they mentioned to get 1600 samples/sec but after optimization, they get 96000 samples/sec with WaveRNN 896 on P100 GPU. Do you have any idea what kind of optimization they did as I read the paper I didn't get much from GPU optimization task and subscale part?
@m-toman hi, how to generate samples from pretrained model?
@zhf459 just pushed a synthesis script.
@rishikksh20 I fear I won't have time to really dig into this as this is just a rather quick experiment Another option would be to try https://github.com/NVIDIA/nv-wavenet
@m-toman do you know how to use nv-wavenet with Tacotron-2, because I trained nv-wavenet in past but unable to integrate with https://github.com/Rayhane-mamah/Tacotron-2. If you are able to that please tell .
@rishikksh20 Unfortunately I didn't find time yet to look into the NVIDIA implementation :(. I've now uploaded two samples and linked them in the README, so I'll close this for now.
I plan to upload pretrained models and samples once I get it to work.