m-toman / tacorn

2018/2019 TTS framework integrating state of the art open source methods
MIT License
47 stars 4 forks source link

Please upload the sample generated by this project . #1

Closed rishikksh20 closed 5 years ago

m-toman commented 5 years ago

I plan to upload pretrained models and samples once I get it to work.

rishikksh20 commented 5 years ago

@m-toman you used n_fft or fft_size for wavernn = 2048 and for Tacotron-2 =1024 , I think both will be same.

rishikksh20 commented 5 years ago

And also lws required to preserve phase.

m-toman commented 5 years ago

I haven't even added Tacotron code to the repository yet, but I'm training a Taco-model with the default settings (which is fft_size=2048 by default as in https://github.com/Rayhane-mamah/Tacotron-2/blob/master/hparams.py#L27)

I'm not extracting mel-spectra separately for Taco and WaveRNN but training WaveRNN from the (transposed) GTA output of Tacotron.

My first quick experiments already produced speech, but I now have to run everything for longer, then I can continue implementing the glue code to make it more convenient.

rishikksh20 commented 5 years ago

Ok got it, @m-toman if you need some computation help I have two gtx 1080 ti , I can also train your model on my PC if you write training code and re-training code. Otherwise, I am also working with Tacotron + WaveRNN project but I am currently more focus on Tacotron 1 (by keithito) + WaveRNN.

m-toman commented 5 years ago

Nice, yeah I think nearly all the repos started out with the Keithito implementation. Just noticed that the sample rate in the implementation I use is now 24000: https://github.com/Rayhane-mamah/Tacotron-2/blob/master/hparams.py#L30 this should explain my bit weird results. And not only because of the mismatch but probably also because of the upsampling network in https://github.com/fatchord/WaveRNN/blob/master/NB5b%20-%20Alternative%20Model%20(Training).ipynb

I now got the https://github.com/Rayhane-mamah/Tacotron-2 repo, changed the params to 22050 (+ hop_size etc.) again, training the LJ data set to some reasonable state and then convert the GTA features like this: https://github.com/m-toman/tacorn/blob/5f851665cdac82b6434c8983d588cc85a9a2296e/wavernn/preprocess.py#L84

I know it's still extremely messy but want to see/hear some results first before putting in more work.

I run on a single GTX 1080 Ti at the moment, but probably still better than transferring all the GTA features.

rishikksh20 commented 5 years ago

Getting following error while training

Traceback (most recent call last):
  File "train.py", line 111, in <module>
    x, m, y = next(iter(data_loader))

 File "/home/humonics/.virtualenvs/tf16/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 314, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "train.py", line 92, in collate
    coarse = np.stack(coarse).astype(np.int64)
  File "/home/humonics/.virtualenvs/tf16/lib/python3.6/site-packages/numpy/core/shape_base.py", line 354, in stack
    raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

Does some kind of padding required ?

m-toman commented 5 years ago

Hi, I've trained a new Taco model over the weekend to 80k iterations and now training WaveRNN on GTA mels. Both on 22050Hz. Not sure yet if it will produce something legit, but described my current process in the README and at 17k steps the speech is at least intelligible.

rishikksh20 commented 5 years ago

Looks good, by the way, could you share 80k Tacotron 2 pretrained model I would like to train it more around 300k.

rishikksh20 commented 5 years ago

I have started training WaveRNN with Tacotron 1, by the what is the inference time of this WaveRNN is it real time? Also is 900 epochs are enough for good result?

m-toman commented 5 years ago

Here is my pretrained taco model: https://www.dropbox.com/s/5svv16eolba0i7o/logs-Tacotron-2.zip?dl=0 Should work if you just put the content into the Tacotron-2 folder, as well as the hparams from this repo (https://github.com/m-toman/tacorn/blob/master/config/hparams.py). And of course you'll have to get the LJ corpus and run preprocess.py from Tacotron.

rishikksh20 commented 5 years ago

Thanks for that I have about to complete training my model with Taco1 just start working with inference, is this model inference real time ?

m-toman commented 5 years ago

From my limited experience until now: no, seemed to take about a minute for a longer sentence. But still much faster than most Wavenet implementations out there.

I haven't tried the Nvidia Realtime implementation yet.

rishikksh20 commented 5 years ago

My first sample : https://drive.google.com/open?id=1xsflF0OPu2f2JISUBOfprxqov6ljgmvZ

Model : 
Tacotron 1 [https://github.com/keithito/tacotron] with pretrained model provide on README 
WaveRNN of this repo with 1000 epochs (205k steps) of training 

Generated sample is bit noisy, I think it requires more training.

m-toman commented 5 years ago

Hmm, do you use a smaller batch size or less data? Because I'm at step 469k and this is only epoch 576 (816 elements per batch).

I'm now seeing pretty nice improvements, here are the samples generated from the GTA input: https://www.dropbox.com/sh/2gtunx8d1r92fqb/AADh9CJEtvHnQ7YlwNClk8X5a?dl=0 I did not run it end-to-end yet.

Here my current WaveRNN models: https://www.dropbox.com/sh/ruq9elymhh9cyjl/AAD8u_PefFz_qwiAckqwqGzwa?dl=0

rishikksh20 commented 5 years ago

I used LJSPEECH dataset with 64 batch size (204 elements per batch), it seems that you first used Tacotron 2 to predict mels files for all sentences for LJSpeech then used that predicted mels to train WaveRNN rather than original mels which used to train Tacotron 2.

rishikksh20 commented 5 years ago

@m-toman is there any way to do sampling at the real-time because on WaveRNN paper they mentioned that it requires some kind of GPU optimization and subscale for that. Right now I get around 1500 samples/sec on paper they mentioned to get 1600 samples/sec but after optimization, they get 96000 samples/sec with WaveRNN 896 on P100 GPU. Do you have any idea what kind of optimization they did as I read the paper I didn't get much from GPU optimization task and subscale part?

zhf459 commented 5 years ago

@m-toman hi, how to generate samples from pretrained model?

m-toman commented 5 years ago

@zhf459 just pushed a synthesis script.

@rishikksh20 I fear I won't have time to really dig into this as this is just a rather quick experiment Another option would be to try https://github.com/NVIDIA/nv-wavenet

rishikksh20 commented 5 years ago

@m-toman do you know how to use nv-wavenet with Tacotron-2, because I trained nv-wavenet in past but unable to integrate with https://github.com/Rayhane-mamah/Tacotron-2. If you are able to that please tell .

m-toman commented 5 years ago

@rishikksh20 Unfortunately I didn't find time yet to look into the NVIDIA implementation :(. I've now uploaded two samples and linked them in the README, so I'll close this for now.