Global condition and Local conditioning

thomasmurphycodes commented 7 years ago

In the white paper, they mention conditioning to a particular speaker as an input they condition globally, and the TTS component as an up-sampled (deconvolution) conditioned locally. For the latter, they also mention that they tried just repeating the values, but found it worked less well than doing the deconvolutions.

Is there effort underway to implement either of these? Practically speaking, implementing the local conditioning would allow us to begin to have this implementation speak recognizable words.

wilsoncai1992 commented 7 years ago

@alexbeloi Any idea how we could verify the local conditioning, and then proceed to completing .wav output?

rafaelvalle commented 7 years ago

@wilsoncai1992 by verify do you mean checking that it's working appropriately? One could, for example, iteratively slice the audio into N equal sized regions and interpolate between 2 conditions, increasing N at each iteration but keeping the number of conditions constant. I would choose 2 conditions that are supposedly easy to train and perceived as dissimilar, e.g. speech and music or Portuguese and German speech or speech and non-speech.

beppeben commented 7 years ago

Hi guys,

I've given a little thought on local conditioning, with the final goal of training the network to do TTS. I am by no means an expert in this domain, I've just tried to come up with some ideas for the task that seamed reasonable to me (but could very well be trivial or wrong). I would love to hear your opinion on this.

So here's how I imagine this to work:

Let's take an input text T that we would like to use as a local condition to the net. Each entry T(i) is a one-hot encoding of a character (so T(i) has dimension 30 or something). We might add two special characters START and END. We could first compute a vector TT in order to roughly identify phonemes from small groups of contiguous letters. Say we use 3 letters, so phoneme i will be defined as

TT(i) = f(T(i-1)w1 + T(i)w2 + T(i+1)*w3)

where f is some nonlinear function (sigmoid or similar) acting pointwise on the input vector. We can associate a feature vector H(i) to each phoneme i, intuitively describing how it sounds. So we can write

H = f(W_H*TT)

Now we need to find a scalar duration S(i) for each phoneme i, giving a measure of how much it will last. We could do it by quantizing a set of possible durations and then defining

S = soft_max(W_S*TT)

Now we need to transform this into a local condition at sample t. This conditioning L(t) could be defined as a "weighted" mixture between neighbouring phonemes

L(t) = sum_i H(i)*N((t-M(i,t))/S(i))

where N denotes the gaussian pdf and M(i,t) is the position of phoneme i in the sample, as seen from sample t. We could model instead MM(i,t) = M(i,t)-M(i-1,t), so we can impose that all components be positive.

The process could be initialized with uniform spacing of phonemes over the wave, i.e. MM(i,0) = sample_size/text_size. But then it can evolve as the wave progresses. So for example the position of the first phoneme M(1, t) could get delayed over time with respect to the initial value, if the wave realizes that it's mostly generating silence at the beginning.

The processes MM(i,) could be modeled using a RNN (or a set of RNNs?) in a way that I haven't completely figured out. The idea is however that its generation be conditioned on input and dilation layers of the original wavenet, and that everything be trained as one single net.

A potential issue that I see is that this has to be trained using a sample size that contains all the given text, since cutting the input text in the middle of a sentence is always somehow arbitrary as you can never be sure that the sound waves contain the corresponding text after the split. So this can be computationally challenging.

What do you think? Has something like this already been done?

Whytehorse commented 7 years ago

@beppeben That's similar to the hidden Markov model approach that was taken by the likes of Carnegie Mellon University in CMU-sphinx. It produces speech that sounds like Stephen Hawking. The new approach uses recurrent neural networks(LSTM). In this approach, you need parameters and hyper-parameters. The hyper-parameters can be anything but for speech it's likely to be frequency range, pitches, intonations, durations, etc. These are what are actually being trained but they could be formulated via FFT and other functions to get a 100% accurate speech synthesizer without training. The parameters could be anything but for speech they are things like text(what to say), voice(how to say it). We could even extend these models to include much more parameters like: mood, speed, etc.

rafaelvalle commented 7 years ago

I assume condition can be done in the same way as char2wav, where a decoder learns vocoder features from a sequence of characters and feeds them into wavenet for training. Note that char2wav is trained end-to-end. https://mila.umontreal.ca/en/publication/char2wav-end-to-end-speech-synthesis/

beppeben commented 7 years ago

@rafaelvalle Thanks for pointing that out. char2wav seems to use a type of local conditioning that was first introduced in a paper by Graves in the context of handwriting generation. I agree that the same attention-based local condition (working directly on the text input, with no separate feature extraction, trained jointly with the main net) could be used for the wavenet.

I was surprised to notice that the conditioning works quite similarly to what I proposed above, being based on a mixture of gaussians for optimally weighting the part of text to focus on at each point in the sample.

In my proposal, every character (or group of 3 characters) has its own gaussian weight, while in Graves' paper the weight of each character is determined by the sum of K gaussian functions (which are the same for all the characters).

Also in my proposal the gaussians are defined and centered on the sample space, while in Graves they live on the characters space. I imagined their means to represent the location of each character (or phoneme) on the wave, while in Graves their means \kappa have a less clear interpretation. Graves'a approach is most probably more convenient than mine though, since the number of gaussians K is fixed and does not depend on the number of characters.

Implementing that on the wavenet would require deciding what variables to use to determine the conditioning weights. Graves uses the first layer of the same RNN that generates the handwriting. char2wav probably does something similar, since the wave generation is also based on a RNN. Wavenet could use some channels of the dilation layers for that purpose, even though I wonder if they contain enough memory for the task.

beppeben commented 7 years ago

I implemented a Graves-style local conditioning in my own fork, by introducing another wavenet at the bottom of the main one, which computes the character attention weights to be later fed into the main net.

I couldn't get any satisfactory results yet, it doesn't seem easy to learn the correct text/wave alignment without a previous segmentation step (as they do in Deep Voice for example).

But it's also true that I only have an old CPU so I had to greatly reduce the size of the net/sampling rate to be able to get any training done at all, so maybe some better results could be achieved with some extra computing power.

dp-aixball commented 7 years ago

@beppeben Any samples? Thanks!

jakeoverbeek commented 7 years ago

Hello guys.

Any progress on the local conditioning? We are doing a thesis about TTS and the WaveNet model looks pretty interesting. We have the computing power to test stuff. @beppeben @alexbeloi @jyegerlehner did you guys make any progress?

matanox commented 6 years ago

@beppeben I think unfortunately indeed the original WaveNet article discloses little of the alignment method. It seems to say not much more than "External models predicting log F0 values and phone durations from linguistic features were also trained for each language.".

Whereas it seems to imply, in Table 2 there in the article, that without these alignments, their own MOS results were not impressive compared to the legacy methods. I find it a little troubling in terms of scientific disclosure. The Baidu deep voice papers indeed include some guidance, but this still looks to me like one of the hardest parts of the architecture to reproduce for a given input dataset!

In stark contrast, the Tacotron paper/architecture eliminates the need for this data preparation step, or at least, it doesn't require phoneme level alignment between the (audio, text) pairs as part of its input.

potrepka commented 6 years ago

Hey, just an idea - since I'm just starting to get into all this (and may actually be thinking of doing a PhD now!) - but it seems to me that pitch detection in music would be a much easier route to go if you're looking to generate a corpus to test local conditioning.

kastnerkyle commented 6 years ago

For interested parties: Merlin (https://github.com/CSTR-Edinburgh/merlin) has a tutorial on how to extract both text features and vocoder features from audio, as well as some chunks doing HMM for text feature -> audio feature alignment. Alternatively, a weaker version of the WaveNet pipeline could take the per timestep recognition information out of Kaldi, maybe using something like Gentle (https://github.com/lowerquality/gentle). I have experimented with these some for "in-the-wild" data. They also talk some about the LSTM setups in the appendix of the WaveNet paper, and DeepVoice has a similar setup which could be helpful reading.

r9y9 commented 6 years ago

Hi, I have also implemented global and local conditioning. See https://github.com/r9y9/wavenet_vocoder for details. Audio samples are found at https://r9y9.github.io/wavenet_vocoder/. I think that is ready to use as a mel-spectrogram vocoder with DeepVoice or Tacotron.

EDIT: you can find some discussion at https://github.com/r9y9/wavenet_vocoder/issues/1.

rafaelvalle commented 6 years ago

This is to confirm that we also got global and local conditioning to work for wavenet decoder based on Mel-spectrograms, the same way @r9y9 has done in his repo, i.e. upsampling with upsampling layers and matrix addition inside of tanh and sigmoid. It would be very useful to have a successful implementation of local conditioning of linguistic features!

aleixcm commented 6 years ago

Hi @rafaelvalle is this version available somewhere? Best.

rafaelvalle commented 6 years ago

@aleixcm NVIDIA has recently released alien CUDA code with faster than real-time inference for wavenet conditioned on mel spectrograms. https://github.com/NVIDIA/nv-wavenet/

candlewill commented 5 years ago

This is a very long thread now and not discussed for nearly one year. Here is some of my understanding, hoping someone could point out my mistake.

It is not well explained how to implement up-sample (deconvolution or repeat) to add linguistic local conditioning features. As the local conditioning should be up-sampled to the same time-series length with audio, we should know audio length before upsample. When training, audio length is known, but it's not known at prediction stage, which lead to be impossible for unsample. One way is to use attention mechanism, but it's not mentioned in the paper. Another way is to use other local conditioning features (e.g., mel features in r9y9's implementation), in which each frame corresponds to a fixed number of samples.

SatyamKumarr commented 5 years ago

@alexbeloi Great work on implementing local conditioning. You have carried work for Text-to-Speech Synthesis, by providing both text, speech pairs as local conditioning. I want to know how is local conditioning carried for voice conversion. Any idea on updating parameters in your code in order to pass Acoustic features?

@Zeta36 @alexbeloi Is it neccesary to use LSTM or RNN with GRU for preserving sequence? https://github.com/ibab/tensorflow-wavenet/issues/117#issuecomment-251682838

ibab / tensorflow-wavenet

Global condition and Local conditioning #112