Which features to implement now?

fehiepsi commented 8 years ago

After @jyegerlehner 's implement of global condition #168 , I would like to ask which features of the model we should focus for now, and which approach we should do for each. Here there are some options which I selected from several threads:

Refactor the code to simplify the train and generate scripts.
Implement local conditions.
Generate good sound (by tuning parameters, or making the training step faster,...)
Add validation set.

nakosung commented 8 years ago

Local conditions. :+1:

fehiepsi commented 8 years ago

Agree, jyegerlehner has mentioned that he has obtained results about generating good sound and implementing train and test sets. For factoring, we can do it later after these pull requests accepted. I will try to implement local conditioning from now on.

jyegerlehner commented 8 years ago

@fehiepsi Whoah let's no oversell things :) When I said "good results" just now in the other thread I was only talking about scalar input being as good as discretized input. I don't think the results are as good as DeepMind's yet. I'm hopeful to see what effect koz4k's VALID convolutions have. Or I may try doing that myself.

@nakosung Regarding local conditions: it's easy enough to add local conditioning. But the big unsolved problem I see is in making it accomplish what we want: e.g. generic text-to-speech, or maybe I should say linguistic-features-to-speech. I had never heard of linguistic features until the Wavenet paper. The problem to me seems to come in figuring out how the time sequence of linguistic features is made to map to the local condition that is supplied to the WaveNet. Does one merely equally space the linguistic features over time, and then do a transposed convolution to upsample them to the audio wavenet local condition? That seems wrong in that one linguistic feature might be short in duration while another is longer. And you'd have to hand-engineer how long the generated sample is based on your knowledge of how long the text is that you want converted to speech. Maybe it's just my ignorance. Somehow DeepMind made it work for at least a cherry-picked case.

My understanding of the basic problem is: given sequence A (linguistic features) and sequence B (the generated audio), and you want to condition B on A, how do you decide which parts of A match up with which parts of B? I think this is the problem addressed by Lu et al and then carried on by DeepMind in this one. Something related to the Viterbi algorithm to find the jointly-most probable way to match up the two sequences. Which leaves me a bit disheartened because it's complicated and I don't know if I'm up for all that.

If someone understands all this better or has a clear idea of how to go about it, please chime in.

jyegerlehner commented 8 years ago

Another thing I'd add to the list is: replace the simple product of gate (sigmoid) and filter (tanh) with the fancy multiplicative unit described in the bytenet paper. I like the idea of a more expressive series of operations that doesn't increase the number of parameters.

nakosung commented 8 years ago

@jyegerlehner I think we could learn a mapping from a series of letters into local conditioning info with proper time stamp. (eg. Good morning --> g....oo.....d.........m...o....rn....i......ng) There is some nice wave-to-text research(https://github.com/buriburisuri/speech-to-text-wavenet) by @buriburisuri, which learns mapping from wave to a series of characters(without timing information) by applying mfcc and CTC loss.

So my idea is like below:

Speed-to-text
- Make speech-to-text-wavenet to generate (text, mfcc,text w/timing information) tuples to train.
Text-to-aligned-local-condition
- Learn a mapping form text --> mfcc,text w/timing information.
WaveNet
- Learn a mapping from mfcc,text w/timing info --> wave.

fehiepsi commented 8 years ago

As I understand, we can extract phonetics, linguistic features from the text using some models. For every 5ms, we have one phone's features. In addition to that, another model will be used to calculate phone's duration (integer number) for each phone such that the sum of all phones' durations optimize the audio duration (sample size). After we have phone duration, the remaining job is to arange these phones in to a list (with repeated by their corresponding durations). Each phone lasts 5ms, which is corresponding to 80 sound samples. We then use transposed convolution to map from 1->80. That's my understanding, we have trained the phones' durations such that their sum fixes the audio sample size, so the match of A and B just works well.

nakosung commented 8 years ago

@fehiepsi I think phone-duration calculation can be done by speech-to-text-wavenet like model.

fehiepsi commented 8 years ago

@nakosung Thank you! I have looked at speech-to-text-wavenet repo. It is kind of speech recognition network (which uses dilated convolution kernel instead of lstm kernel). It is reasonable to do your suggestion steps at training. I have a question: They use ctc loss (which means that network will try to predict the most suitable text sentence for an input audio wave ). Is the middle information, which is phonemes (with blank and repeated), selected from the middle output with highest probability?

jyegerlehner commented 7 years ago

Thanks for those ideas and explanations. You are both way ahead of me.

~~@nakosung Wouldn't the "text-to-aligned-local-condition" portion of your scheme be unnecessary with the CTC idea you mentioned here?~~

[Edit] Never mind, that wouldn't make sense.

buriburisuri commented 7 years ago

@fehiepsi and @nakosung I think that it's a challenging job to build the model which can learn TTS task end-to-end fashion (sentence text -> sentence wave) but it'll be worth to try. As @fehiepsi mentioned above, I guess that the softmax prob output results ( before ctc beam search ) will be used as a timing information. But I'm afraid that the full architecture may be too complex to run on GPUs. IMHO... T.T.

nakosung commented 7 years ago

@buriburisuri glad to see you here! I think we can break down a monster end-to-end wavenet model into two smaller networks.

seq2seq - Text (series of chars) --> Local condition w/timing(like phonemes)
ibab/wavenet - Local condition w/timing --> Wave

Second network may fit within TitanXP. IMO, The first network should be smaller than wavenet, so it might fit also. :)

buriburisuri commented 7 years ago

@nakosung I agee that we need to break the overall network into smaller ones and train each independently first. ( and final joint train )

fehiepsi commented 7 years ago

Thank you a lot for discussions! I have learnt many things from you all in the past few days. ^^

I want to present my thoughts a little bit. According to the wavenet paper, they use LSTM with mean square loss to train the duration (and CNN for F0 score). So I think that they do three steps. First step is to create phonetic features from text, and also durations, f0 using HMM. Then they create two models LSTM and CNN, one to predict durations (with input is phonetics features, and output is the durations got from HMM outputs), another to predict f0. Finally, feed linguistics features, durations (and its consequence: position in frame), and f0 to the local condition of wavenet.

For the generation, put in the text, using some model in NLP to get a sequence of phonetics features. Then use the above two models LSTM and CNN to get durations and f0. Finally, use wavenet model to get raw audio. This way, I think that the generation steps is still fast enough.

I think that I will follow the approach of wavenet's author first. For HMM, we can use hts. I am figuring out how to use it from this thread #92 .

Edit: I think that after we get the results from HMM, we feed them into local condition of wavenet (do not have to build LSTM and CNN). We just build LSTM and CNN for generation step.

GuangChen2016 commented 7 years ago

@fehiepsi I agreed with you a lot. But have you got some ideas to feed these features (linguistics features, durations , and f0) to the local condition of wavenet, in order that these features could matched the audio wav (As some longer length audio is cut into a small pieces). Do you plan to transform the frame (5ms) information into 80 points???

fehiepsi commented 7 years ago

@GuangChen2016 : In my opinion, for each audio file, we will extract sequence of local condition first, save it to a file with same name (different tail). Then whenever we load a audio file, we also read its local conditioning corresponding file. Whenever we enqueue a piece from a file, we also enqueue the corresponding local condition (with local condition length's = piece's length / 80). For transform, I think that we will use transposed convolution (I have not looked about its details yet, but information about this conv net is available around internet, e.g. here and here). Then these features will be feed into the dilated layers as the same way as global condition, which is implemented at #168.

root20 commented 7 years ago

@GuangChen2016 Maybe Figure 1. of this paper can help your understanding https://arxiv.org/abs/1606.06061

They extract phoneme duration and copy the linguistic feature for that duration. And some additional features are concatenated to the copied linguistic features. (Ex. position of the current frame in the current phoneme) Then you can match the number of features (by coping) to the number of frames.

fanskyer commented 7 years ago

perhaps try voice conversion before text synthesis? as for voice conversion you can easily obtain the phoneme, duration, f0 and many other information. save some hassle for the synthesis's front end.

jyegerlehner commented 7 years ago

I was wondering if we couldn't just go straight from characters to speech, and skip the whole alignment/phonemes problem. The bytenet German/English translations shows the net can deal with some misalignment across time, as word order is not always the same in the two languages. And Sejnowski did TTS it in the 80's to a certain extent, although there I don't know that the alignment was learnt: 1 2.

Zeta36 commented 7 years ago

I wonder whether for the alignment problem we could train an autoencoder with our window size as the "code" layer size. In this way, we could extract the patterns, reduce the dimensionality, and compress the main linguistics features into a fixed window size tensor to feed the local conditioning model.

The autoencoder may use as input for example something like #92 , and after training, we would have an encoder with a fixed window output (the shape of the "code" layer = the shape of the WaveNet window) for each speaker in the corpus.

I'm just speculating. I don't know if I'm saying something silly.

Regards.

tsiaras commented 7 years ago

The easiest way to implement the local conditioning is the following: rockyrmit commented on 12 Oct about using Merlin and the CMU_ARCTIC datasets

In order to use this dataset download Merlin http://www.cstr.ed.ac.uk/projects/merlin/ Then go to merlin/egs/slt_arctic/s1 and run the run_full_voice.sh. You should edit the run_full_voice.sh file and run only steps 1,2,3 and avoid running steps 4 and 5 because step 4 changes the duration of the binary label files from the original duration to predicted duration. After the step 3 has created the binary labels in merlin/egs/slt_arctic/s1/experiments/slt_arctic_full/acoustic_model/data/binary_label_425, you should either min-max normalize the data or discard the data fields 417:425. A naive pseudo-code for normalization is for i = 1:425 if (max_values(i) - min_values(i) > 0) new_labels(i, :) = 0.98*(labels(i, :) - min_values(i))/(max_values(i) - min_values(i)) + 0.01; end end

The wav files are time aligned with the normalized data. However the normalized data are per frames so we must either upsample them by 80 (e.g. np.repeat(labels, samples_per_frame, 0)) or use transposed convolution. The final program is similar to the one with global conditioning (see #168).

GuangChen2016 commented 7 years ago

@tsiaras So, have you implemented local condition in this way?

tsiaras commented 7 years ago

Yes, but I have not a GPU and the program is still running on CPU. Up to now 21000 iterations you can hear the synthesized sentence but the quality is not very well.

On Wed, Dec 7, 2016 at 10:49 AM, GuangChen2016 notifications@github.com wrote:

@tsiaras https://github.com/tsiaras So, have you implemented local condition in this way?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/189#issuecomment-265390414, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkAtCjVcigqdQCSIwd1sFYUwe7ilnCKks5rFnM1gaJpZM4K8inf .

Zeta36 commented 7 years ago

@tsiaras, and could you please share with us a fork of that implementation with some data sets included? Here we have people with lots of greats GPU to help in the training, and it'd be wonderful to have finally a synthesized sentence sample.

tsiaras commented 7 years ago

Tomorrow I will upload the code and some test data. However, the code is in very experimental condition and needs improvement from the community.

On Wed, Dec 7, 2016 at 7:40 PM, Samuel Graván notifications@github.com wrote:

@tsiaras https://github.com/tsiaras, and could you please share with us a fork of that implementation with some data sets included? Here we have people with lots of greats GPU to help in the training, and it will be wonderful to have al least a synthesized sentence.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/189#issuecomment-265516561, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkAtBh0clV7u8B7FI2rr87eSMGvEewKks5rFu99gaJpZM4K8inf .

Zeta36 commented 7 years ago

Thank you, @tsiaras :).

GuangChen2016 commented 7 years ago

@jyegerlehner I am using your implementation of the Global condition, but I got very poor results compared with you. I just use --gc_channels=32 to train the global condition model, and keep other parameters the same with you. Do you have some suggestions to generate a good voice kile yours? Thank you very much.

tsiaras commented 7 years ago

You can find a very basic implementation in https://github.com/tsiaras/tensorflow-wavenet/ From the speech that I synthesized up to now it seems that there is a bad synchronization between labels and audio at the beginning of generation but progressively the speech is improved.

Please read the README.txt files in the base directory and in data directory.

On Wed, Dec 7, 2016 at 7:40 PM, Samuel Graván notifications@github.com wrote:

@tsiaras https://github.com/tsiaras, and could you please share with us a fork of that implementation with some data sets included? Here we have people with lots of greats GPU to help in the training, and it will be wonderful to have al least a synthesized sentence.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/189#issuecomment-265516561, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkAtBh0clV7u8B7FI2rr87eSMGvEewKks5rFu99gaJpZM4K8inf .

GuangChen2016 commented 7 years ago

@tsiaras Can you send some example you have synthesized now?

jyegerlehner commented 7 years ago

@GuangChen2016 I think I produced those results with this branch. It has a few more changes beyond just the global condition changes. I'm not certain that is what makes the difference. The script I ran to train is traingc.sh in that branch. Though learning rate may have been adjusted.

tomlepaine commented 7 years ago

Has anyone here tried out a bytenet style model for text to speech yet? I'm really curious how well it would perform :smile:

GuangChen2016 commented 7 years ago

@jyegerlehner Thank you very much. I have used the branch you suggested for me. However, the results I have got now is still far from with yours and seems very bad. Any suggestions for me? Thank you very much.

jyegerlehner commented 7 years ago

@GuangChen2016 Does your loss curve look like this?

screenshot from 2016-12-21 14 16 59

This is the curve I got training using that branch.

I see a difference in that working directory vs what's in the branch you referred to. resumegc.sh shows a switch to adam from rmsprop. I dropped the learning rate every now and then by and order of magnitude as training progressed.

I was running a daily version of TF just a bit earlier than the 0.12 release.

GuangChen2016 commented 7 years ago

@jyegerlehner No, my final loss is a bit higher than yours, like this picture shows. I don't know what makes the difference? Are you sure you use the branch and the traingc.sh for training? Or can you tell me some other changes you have made? Thank you. 5car jmsud j ngnn8e5 3

jyegerlehner commented 7 years ago

@GuangChen2016 traingc.sh is only to start training. To resume training after changing the learn rate I used resumegc.sh. But you would have to adjust the learning rate in that file because it's set to the last value I used.

A diff in the working directory shows only minor differences:

jim@sc1:~/dev/models/tensorflow/tensorflow-wavenet$ git pull origin fullcorpus_lg 
From https://github.com/jyegerlehner/tensorflow-wavenet
 * branch            fullcorpus_lg -> FETCH_HEAD
Already up-to-date.
jim@sc1:~/dev/models/tensorflow/tensorflow-wavenet$ git status
On branch fullcorpus_lg
nothing to commit, working directory clean
jim@sc1:~/dev/models/tensorflow/tensorflow-wavenet$ git diff fullcorpus_lg origin/fullcorpus_lg 
diff --git a/.gitignore b/.gitignore
index b1c19c5..231d390 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,4 +1,4 @@
 logdir/
 VCTK-Corpus/
 *.pyc
-*.wav
+
diff --git a/generategc.sh b/generategc.sh
index c8bb5ca..f81ed88 100644
--- a/generategc.sh
+++ b/generategc.sh
@@ -1,2 +1,2 @@
-python generate.py --samples 80000 --wav_out_path generated_r_280_500k.wav --fast_generation false --gc_channels=48 --gc_id=280 --gc_cardinality=377 /media/mass1/logs/gc/train/201
+python generate.py --samples 80000 --wav_out_path generated_s_p280_112K.wav --fast_generation false --gc_channels=32 --gc_id=280 --gc_cardinality=377 /media/mass1/logs/gc/train/20

diff --git a/resumegc.sh b/resumegc.sh
index ed12f23..f9b94b1 100644
--- a/resumegc.sh
+++ b/resumegc.sh
@@ -1 +1 @@
-python train.py --data_dir=/media/mass1/audio_data/VCTK-Corpus --logdir=/media/mass1/logs/gc/train/2016-11-19T01-48-23 --learning_rate=0.0005 --momentum=0.9 --optimizer=adam --gc_
+python train.py --data_dir=/media/mass1/audio_data/VCTK-Corpus --logdir=/media/mass1/logs/gc/train/2016-11-19T15-25-07 --learning_rate=0.00008 --momentum=0.9 --optimizer=adam --gc
diff --git a/wavenet_params.json b/wavenet_params.json
index b9da360..22acb03 100644
--- a/wavenet_params.json
+++ b/wavenet_params.json
@@ -10,7 +10,7 @@
     "residual_channels": 48,
     "dilation_channels": 48,
     "quantization_channels": 256,
-    "skip_channels": 1578,
+    "skip_channels": 1536,
     "use_biases": true,
     "scalar_input": true,
     "initial_filter_width": 2,
jim@sc1:~/dev/models/tensorflow/tensorflow-wavenet$

The --silence_threshold changed. Skip channels is slightly different but I don't think that would make a big difference.

GuangChen2016 commented 7 years ago

@jyegerlehner OK, thank you very much. I will try that.

weixsong commented 7 years ago

@tsiaras , could share some generated wave that with local conditions?

liangmin0020 commented 7 years ago

@tsiaras May I ask several questions about your implementation of local conditions? 1) why use coefficient=0.1 in the initialization of the weights of local condition 2) why not use biases for local condition 3) For the feature of the labels, the propose of line 94-95 in audio_reader.py is?