keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 956 forks source link

Nancy Corpus pre-trained #87

Open LearnedVector opened 6 years ago

LearnedVector commented 6 years ago

Is the nancy corpus pre-trained model available anywhere for use? I think the one provided in the README is LJ Speech trained model

keithito commented 6 years ago

@MXGray: would you be willing to share your pre-trained model on the Nancy corpus?

MXGray commented 6 years ago

@keithito @Mn0491

No problem - Here you go: https://github.com/keithito/tacotron/issues/15#issuecomment-342632496

LearnedVector commented 6 years ago

@MXGray Thank you so much!

The one that you linked to doesn't seem to perform as well as the on https://keithito.github.io/audio-samples/. Are they the same model? Here is a sample of what the model outputted

This is me using the default demo-server.py and pointing the checkpoint to the nancy corpus model you provided.

Thanks again for posting the link, and thank you @keithito for this awesome project.

MXGray commented 6 years ago

@Mn0491

Oh, my bad - That's the Tagalog model that I trained on top of the Nancy model. I'll upload the correct Nancy model later tonight when I get in front of my laptop and will post it here.

geneing commented 6 years ago

@MXGray could you, please, upload the English model. Model in your drive reference ref doesn't generate english speech.

MXGray commented 6 years ago

@Mn0491 @geneing Sorry guys, crazy holidays. :) Here you go - Happy 2018! https://drive.google.com/file/d/1c_O-Gha03_erKbilsFCvs9QJ8faJ7ou8/view?usp=sharing

LearnedVector commented 6 years ago

@MXGray thank you! Happy 2018!

geneing commented 6 years ago

@MXGray Thank you. It works now.

MXGray commented 6 years ago

@t3t3t3 Are you training a model on top of this? If so, then after preprocessing your training data, you'll get max output length. Divide this with output per step, and use that for max iters in hparams.py ... Hope this helps. :)

t3t3t3 commented 6 years ago

Thanks! I think I messed up the parameters somewhere. I reinstall the clean source code and it works now!

begeekmyfriend commented 6 years ago

Excuse me. Can you tell me how do I download the Nancy Corpus dataset from blizzard 2011 on CSTR. I cannot even find the entry for register. @MXGray @keithito

gloriouskilka commented 6 years ago

@begeekmyfriend Hello! You need to click "license", it'll redirect to this page, fill in the form and wait few days for it to be approved. You'll get 2 emails total

http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/license.html

begeekmyfriend commented 6 years ago

@gloriouskilka Thanks a lot. Nancy Corpus dataset seems better for training as English language.

js0nwu commented 6 years ago

@MXGray Thanks! Do you mind if I upload your trained model to a public GitHub repository? I'd like to make a Docker container for running Tacotron. (curl doesn't play well with Google Drive links)

MXGray commented 6 years ago

@ArkaneCow No, I don't mind. It'll be very helpful. Please share the link here once it's up. Thanks. :)

js0nwu commented 6 years ago

@MXGray Thanks! I created a Docker for running this repository here: https://hub.docker.com/r/arkanecow/dockerfile-keithito-tacotron/ The Docker file is here: https://github.com/ArkaneCow/dockerfile-keithito-tacotron The repository where the models are hosted is here: https://github.com/ArkaneCow/tacotron-models

quadraaa commented 6 years ago

@MXGray Could you please clarify, did you use the default parameters for training on Nancy corpus? Thanks in advance!

ghost commented 6 years ago

@MXGray Thanks for your contribution. I listened the demo audio with LJSpeech and Nancy both, found that Nancy is better.

I download files from the official Nancy website you provided, but I do not know how to handle those files. Therefore, I downloaded the wav and text from here:

https://github.com/barronalex/Tacotron/blob/master/download_data.sh

I changed the contents format in prompts.data into metadata.csv, such as: APDC2-017-01|Children act prematurely.|Children act prematurely.

However, I got an error during running train.py:

Starting new training run at commit: None Generated 32 batches of size 32 in 2.455 sec Traceback (most recent call last): File "/home/chris/new_2018/u41_2_nancy/tacotron/tacotron/datasets/datafeeder.py", line 74, in run self._enqueue_next_group() File "/home/chris/new_2018/u41_2_nancy/tacotron/tacotron/datasets/datafeeder.py", line 96, in _enqueue_next_group self._session.run(self._enqueue_op, feed_dict=feed_dict) File "/home/chris/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/home/chris/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1096, in _run % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (32, 665, 64) for Tensor 'datafeeder/mel_targets:0', which has shape '(?, ?, 80)'

Is there anything I should edit in the original code written by @keithito, as the different between ljspeech and nancy corpus?

MXGray commented 6 years ago

@Quadraaa @DavidAksnes

All default parameters in hparams.py, except max_iters. This value should be set to max output length divided by output per step. For example - After preprocessing the Nancy dataset, let's say you get 1605 as the value of max output length; and Default output per step is 5, so:

1605 / 5 = 321 (this should be the value of max_iters)

Hope this helps!

marcom48 commented 6 years ago

Hi @MXGray,

I've been training off the Nancy Corpus as you did, using the default parameters. I've synthesises to 233000 steps, but whenever I synthesise a sentence it has lots of echoing afterwards, whereas your model doesn't generate echoes. I was wondering if you had any suggestions how to fix this? Find an example below using the prompt "This is the example recording." c9aeb4d2-b929-4c46-bdef-d644a36280f3.wav.zip

Thanks

navidnadery commented 6 years ago

Hi @MXGray, Thank you for sharing such a perfect trained model. I used your model as a pretrained model for synthesizing in my own language. My data is about 2.2 h with single speaker and most of phonemes are mapped like CMUDict phonemes. After 950K iters, the model can only align first half of middle and long sentences. The other half is so bad and it seems that the model can not learn to align second part of the sentences. Why this happens? How can I fix this?

Thanks

navidnadery commented 6 years ago

@keithito Do you have any idea about mentioned problem of mine?

keithito commented 6 years ago

@navidnadery I'm not sure why this would happen. Maybe there's a lack of long sentences in your training data? You can also try with Location Sensitive Attention (or hybrid attention) to see if that yields a better result.

Shikherneo2 commented 6 years ago

Hi guys!! Does anyone know if the sample_rate in hparams.py need to be changed to the sampling frequency of the sound files in the dataset? Like 16K for the Nancy dataset?

yoosif0 commented 6 years ago

@Shikherneo2 you mean sample_rate?

marymirzaei commented 6 years ago

It seems that with the new updates on the code it cannot be used with Nancy pre-trained model. Does anyone have a similar issue?

yoosif0 commented 6 years ago

I think you could not use a checkpoint when hparams are changed.

AyaLahlou commented 4 years ago

Can anyone share some part of code or some directives to how I can use this pre-trained model ?(using the same code as the one shared in this repo).

I tried running this : !python3 eval.py --checkpoint nancy_model/model.ckpt-250000

but I get the following error : [...] NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key model/inference/decoder/output_projection_wrapper/multi_rnn_cell/cell_0/output_projection_wrapper/concat_output_and_attention_wrapper/decoder_prenet_wrapper/attention_wrapper/bahdanau_attention/attention_v not found in checkpoint [[node save/RestoreV2 (defined at /home/ec2-user/SageMaker/tacotron/synthesizer.py:24) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]