syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"
368 stars 110 forks source link

eval form checkpoints #4

Open marymirzaei opened 6 years ago

marymirzaei commented 6 years ago

Thanks for your nice work. I have trained the model on Blizzard 2013 dataset. The synthesized files from 185k and 385k checkpoints are available in the following link. I used the samples from LJ-Speech (LJ001-0001.wav) and Nancy (nancy.wav) as reference files for checking the performance. I also included the mocdel.checkpointfiles and the audio files at each step (step-185000-audio.wav, step-385000-audio.wav). https://www.dropbox.com/sh/jhcynw65o1tmj7r/AABJN4cBotdbs-A5-Rk89vt0a?dl=0 Any idea on how to improve the shaking voice?

syang1993 commented 6 years ago

Yes the voice from the link is a little shaking. Can you share the hyper-parameter and the alignments of the test setences of your experiments? Besides, I found use the character-leval inputs is better than phoneme-level in my earlier experiments, though the paper used the phoneme inputs.

syang1993 commented 6 years ago

@lapwing I updated some codes today, with which the quality of generated speech is much better than before (87K steps). It aslo works with small reduce factor. eval-87k-r2.zip

I will also eval the performance these days.

fazlekarim commented 6 years ago

the new update seems to have set cmu-dict to false. Is that what you used to get those results?

marymirzaei commented 6 years ago

Thank you very much! sorry for the delay... I have uploaded the files you wanted to the link above, if you still want to take a look. The quality is very good! Do you have any plans to release the checkpoints?

syang1993 commented 6 years ago

@fazlekarim Yes, the hparams in the repo is 100% matching to my experiments.

syang1993 commented 6 years ago

@lapwing I didn't see the setting in your link, anyway I guess you can try the new code.

Since I only have one single GPU, it will take several days to test the new code. I'm very grateful if you can help to test the performance. As for now, I find the quality is better, but the style is learned slower than before(100K steps). I will continue training to see if it can get stable results with and without style attention. Also I will upload the checkpoints and new samples once I finished theses experiments in several days.

marymirzaei commented 6 years ago

Sure! I will do the same and will upload the results so that we can compare. Thanks for your nice work!

butterl commented 6 years ago

@syang1993 are those samples generated directly from Tacotron ? the audio quality is amazing

syang1993 commented 6 years ago

@butterl Which sample do you mean? The samples attached in this issues were generated from gst-tacotron repo directly using blizzard 2011 data. The samples in the demo page were also directly genereted from gst-tacotron using blizzard2013 data. I also did experiments with tacotron using bc2011 data, the samples can be found in https://github.com/keithito/tacotron/pull/182

butterl commented 6 years ago

@syang1993 thanks for reaching out, I‘ve tried keithito/tacotron and Rayhane-mamah/Tacotron-2 all seems generate wav with shake & echo like @lapwing’s sample and even worse(even with wavenet 300K as vcoder), and your sample wav attached is much clear, and you posted “I updated some codes today” 15 days ago, but I do not find the exactly patch.

Will try with this amazing repo to reproduce

syang1993 commented 6 years ago

@butterl Maybe you can try the modified keithito's tacotron in my repo, which is forked from the original one and fixed the issues to support small reduce factor. @fazlekarim may have tried this repo, I'm not sure whether he get good results. And the commit of "I updated some code today" is https://github.com/syang1993/gst-tacotron/commit/ba10ee1a30044d2582f8bf49d7ab158ff0846dd1

fazlekarim commented 6 years ago

@butterl I was satisfied with my results. I can show them to you if you are interested?

butterl commented 6 years ago

@fazlekarim thanks for reaching out, I'm be very interested in your sample , because mine is much worse with other repo even trained to 400K, and now I will switch to this one and give feed back

fazlekarim commented 6 years ago

This is the only one I have saved in this computer. Let me know what you think about it.

eval-227300_ref-original.zip

butterl commented 6 years ago

@fazlekarim thanks for reaching out , the wav is good ,and seem have more shaking than eval-87k-r2.zip @syang1993 shared

@syang1993 I trained in my machine and the result is good, but for eval it failed some times (2/3) image

and with use_gst=False the eval would returns error

Use random weight for GST.
Traceback (most recent call last):
  File "eval.py", line 65, in <module>
    main()
  File "eval.py", line 61, in main
    run_eval(args)
  File "eval.py", line 25, in run_eval
    synth.load(args.checkpoint, args.reference_audio)
  File "/home/public/gst-tacotron/synthesizer.py", line 29, in load
    self.model.initialize(inputs, input_lengths, mel_targets=mel_targets, reference_mel=reference_mel)
  File "/home/public/gst-tacotron/models/tacotron.py", line 88, in initialize
    style_embeddings = tf.matmul(random_weights, tf.nn.tanh(gst_tokens))
UnboundLocalError: local variable 'gst_tokens' referenced before assignment
syang1993 commented 6 years ago

@butterl How many steps do you train? Do you also use the BC2013 or BC2011 data?

If you set use_gst=False, it means you will not use the style attention, then you must feed reference_audio to model during eval.

butterl commented 6 years ago

@syang1993 the training step is 77k I tried with two experiments on eval:

  1. use_gst=True,and feed wav from the training set , the out sometimes fail(not aligned and wav is small)
  2. use_gst=False,and with reference_audio path feed,erro turns out to be as below, seems network could not mach
    
    Loading checkpoint: ./logs-tacotron/model.ckpt-77000
    Traceback (most recent call last):
    File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
    File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
    File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
    tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [384,256] rhs shape= [512,256]
         [[Node: save/Assign_152 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/memory_layer/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/memory_layer/kernel, save/RestoreV2/_213)]]
         [[Node: save/RestoreV2/_154 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_160_save/RestoreV2", _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/RestoreV2:169)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "eval.py", line 65, in main() File "eval.py", line 61, in main run_eval(args) File "eval.py", line 25, in run_eval synth.load(args.checkpoint, args.reference_audio) File "/home/public/gst-tacotron/synthesizer.py", line 37, in load saver.restore(self.session, checkpoint_path) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1755, in restore {self.saver_def.filename_tensor_name: save_path}) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1137, in _run feed_dict_tensor, options, run_metadata) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run options, run_metadata) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [384,256] rhs shape= [512,256] [[Node: save/Assign_152 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/memory_layer/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/memory_layer/kernel, save/RestoreV2/_213)]] [[Node: save/RestoreV2/_154 = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_160_save/RestoreV2", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'save/Assign_152', defined at: File "eval.py", line 65, in main() File "eval.py", line 61, in main run_eval(args) File "eval.py", line 25, in run_eval synth.load(args.checkpoint, args.reference_audio) File "/home/public/gst-tacotron/synthesizer.py", line 36, in load saver = tf.train.Saver() File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1293, in init self.build() File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1302, in build self._build(self._filename, build_save=True, build_restore=True) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1339, in _build build_save=build_save, build_restore=build_restore) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 796, in _build_internal restore_sequentially, reshape) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 471, in _AddRestoreOps assign_ops.append(saveable.restore(saveable_tensors, shapes)) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 161, in restore self.op.get_shape().is_fully_defined()) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/ops/state_ops.py", line 280, in assign validate_shape=validate_shape) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_state_ops.py", line 58, in assign use_locking=use_locking, name=name) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op op_def=op_def) File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [384,256] rhs shape= [512,256] [[Node: save/Assign_152 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/memory_layer/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/memory_layer/kernel, save/RestoreV2/_213)]] [[Node: save/RestoreV2/_154 = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_160_save/RestoreV2", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

syang1993 commented 6 years ago

@butterl Since the model is more complex than tacotron, it may need more data and training steps to get convergence. The flag use_gst means two different model, you must train an new model with use_gst=False setting.

butterl commented 6 years ago

@syang1993 Thanks! will wait to see good result. BTW, could we put the eval mel to r9y9‘s wavenet ?

butterl commented 6 years ago

@syang1993 tried with 100K model the output is good, but the eval text would cut by “,” e.g. "he'd like to help the girl, who’s wearing the red coat." will only output wav before ", " ,and would output all when remove “,”,tried some print

    wav = audio.inv_preemphasis(wav)
    print("wav len="+str(len(wav)))
    end_point = audio.find_endpoint(wav)
    wav = wav[:end_point]
    print("wav len="+str(len(wav)))

wav len=400600 wav len=102400 seems the wav is cut here by slicense time

syang1993 commented 6 years ago

@butterl Without the third line, does the generated wav contain the latter speech? If it contains, maybe we need to modify the value of min_silence_sec (default 0.8) in find_endpoint function. Thanks for pointing it out.

marymirzaei commented 6 years ago

I think the new code works very well. I trained up to 437k and you can find the samples generated using you reference_audio-2.wav file in the following link: https://www.dropbox.com/sh/8cbrog2mtc8h8xw/AABOTLi0j8-06At3zdrHeQNra?dl=0

However, I found out that every time I do eval from the same checkpoint I get different results. Why is it so?

syang1993 commented 6 years ago

@lapwing Thanks for sharing , it sounds good. I'm not sure why it generate different results, there may exist a generation issue. I'm on a summer vacation these weeks and cannot test it. I will test it later to find what cause this problem. If you find it out, could you let me know? Thanks.

ZohaibAhmed commented 6 years ago

@syang1993 - is it possible to get the trained model that you used to generate the samples for eval-87k-r2.zip?

syang1993 commented 6 years ago

@ZohaibAhmed Hi, since I'm on a summer vocation these weeks, I will send it to you after I go back to school. Besides, you can get this model using Blizzard 2011 database, it will not take so long time.

peter05010402 commented 5 years ago

Thanks for your nice work. I have trained the model on Blizzard 2013 dataset. The synthesized files from 185k and 385k checkpoints are available in the following link. I used the samples from LJ-Speech (LJ001-0001.wav) and Nancy (nancy.wav) as reference files for checking the performance. I also included the mocdel.checkpointfiles and the audio files at each step (step-185000-audio.wav, step-385000-audio.wav). https://www.dropbox.com/sh/jhcynw65o1tmj7r/AABJN4cBotdbs-A5-Rk89vt0a?dl=0 Any idea on how to improve the shaking voice?

@lapwing could you share the hyper-parameter? The PT couldn't be reload with default hyper-parameter Thank you!

peter05010402 commented 5 years ago

@ZohaibAhmed Hi, since I'm on a summer vocation these weeks, I will send it to you after I go back to school. Besides, you can get this model using Blizzard 2011 database, it will not take so long time.

@syang1993 Hi, Could you send me the trained model that you used to generate the samples for eval-87k-r2.zip? Thank you!

ishandutta2007 commented 5 years ago

@lapwing can you share the 437k model ?

renerocksai commented 5 years ago

@lapwing Thanks for sharing , it sounds good. I'm not sure why it generate different results, there may exist a generation issue. [...]

At the top of eval.py, before anything else is imported, I put

import random
random.seed(42)
import numpy
numpy.random.seed(42)
from tensorflow import set_random_seed
set_random_seed(42)

This sets a fixed seed for all random number generators that could be involved - and it does the trick. Now, I don't see any random numbers used in the gst-tacotron code itself that would cause randomness at inference time, but maybe something's going on in some imported lib. Anyway, the fixed seeds lead to reproducible results.

luantunez commented 3 years ago

Hello! Thank you for your work! Could you send me the pretrained model please? luantunez95@gmail.com