ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.41k stars 1.29k forks source link

Trying to condition on F0 (problem with embedding) #198

Open belevtsoff opened 7 years ago

belevtsoff commented 7 years ago

I'm using a simple implementation of local conditioning (similar to @alexbeloi 's one) to try to condition the input data on its fundamental frequency. I'm currently working with the toy data (i.e. feeding in sinusoids with time-varying frequencies as inputs and its frequency profile as local condition stream). Now, if I use LC as a single-channel stream with the floating point value of the current frequency (aka F0), then the loss almost refuses to decrease, so it couldn't learn. On the other hand, If I one-hot encode each F0 value in the stream, then it sort of works (loss drops very low and the generated results are reasonable). But of course one-hot encoding limits the resolution of possible F0 values.

My intuition is that one-hot encoding of the local condition stream allows the network to use different weights to activate particular subsets of filters (and inhibit the others) depending on the local condition, instead of simply adding some time-varying number to all filters. I guess the same intuition must apply to GC as well.

Now, does this all sound like bullshit to you guys? And does anybody know how deep-mind might have tackled the conditioning on F0? Maybe their "learned upsampling" also served as a mini-network to create useful F0 embeddings?

jyegerlehner commented 7 years ago

That sounds like a plausible speculation to me.

One thing you could do to make a higher-resolution discrete encoding of your F0 value require fewer resources is to encode (probably linear, not mu-law) into a relatively high number of discrete values (say 1024 or even 65536 or somewhere in between), and then use that integer to lookup an encoding of a desired size (say, a size 16 vector) using tf.nn.embedding_lookup. Instead of using a one-hot encoding. That might still retain the better behavior you found with the one-hot encoding while not letting the size of the net get out-of-hand. Just a thought.

In the branch I'm working on, I changed the audio input sample from one-hot followed by a conv filter into just such an embedding lookup. Which I think is simpler and... well... I never liked that thing where we performed a conv on one-hot values.

belevtsoff commented 7 years ago

@jyegerlehner Thanks for the hints! Yeah, doing convolutions on one-hot channels is really hard to interpret... I was thinking about doing gradient descent on white-noise input to see what kind of filters the network develops, but then it would probably be hard to interpret them given the one-hot encoded nature of the input. Does using scalar input hurt the quality of learning / ouput?

Regarding the embedding, would you suggest using binary encoding of that integer as an embedding matrix? What about adding a "fan-out" layer instead, so that the network can learn it's optimal embedding itself? Planning to try the latter as well

GuangChen2016 commented 7 years ago

@belevtsoff @jyegerlehner Hi, I am working on local condition as well, and the speech I synthesized so far is not that good. Do you have some speech examples that synthesized on specific text? thank you.

belevtsoff commented 7 years ago

@GuangChen2016 I didn't try conditioning on text features. We now try to condition on spectral features along with F0. I'll post some samples when we get any

jyegerlehner commented 7 years ago

@belevtsoff "Hints" suggests I know the answers; I was merely tossing out a couple speculative ideas. I doubt I know any more than you.

Regarding the embedding, would you suggest using binary encoding of that integer as an embedding matrix? What about adding a "fan-out" layer instead, so that the network can learn it's optimal embedding itself

I think an embedding table provided to tf.nn.embedding_lookup is already a learnable parameter tensor. So I'm not sure how it's different from your fan-out (or fan-in) layer.

@GuangChen2016 No not yet. What I'm doing is a long-shot and might well not pan out. If I get interesting results I'll post about it, but don't hold your breath.

belevtsoff commented 7 years ago

@jyegerlehner

I think an embedding table provided to tf.nn.embedding_lookup is already a learnable parameter tensor

Oh, I see now... that's the part I was missing. Indeed, this way it must be similar to what I was thinking. It's a ton of learnable parameters though... I'll see how it works out

sonach commented 7 years ago

@belevtsoff @jyegerlehner Thank you for the discussing of F0 conditioning. I am using just the float point F0 value followed by 1x1 conv filters to every layer like this: Wf_lf0_ctx = self.weight_variable([1, 1, dilation_channels], 'ctx_lf0_filter') Wg_lf0_ctx = self.weight_variable([1, 1, dilation_channels], 'ctx_lf0_gate') lf0_filter = tf.nn.conv1d(lf0_ctx, Wf_lf0_ctx, stride=1, padding='SAME') lf0_gate = tf.nn.conv1d(lf0_ctx, Wg_lf0_ctx, stride=1, padding='SAME') filter_output = filter_output + lf0_filter gate_output = gate_output + lf0_gate So I think I should using embeddings instead of floating-points.

@GuangChen2016 I am working on local-conditioning also (for Chinese), and I am not getting good results either. My text context is about 900 dimensions(current phone, left phone, left-left phone, etc,). I am still struggling on this. My code for adding text context to every layer like this: Wf_text_ctx = self.weight_variable([1, text_ctx_dim, dilation_channels], 'ctx_text_filter') Wg_text_ctx = self.weight_variable([1, text_ctx_dim, dilation_channels], 'ctx_text_gate') ctx_filter = tf.nn.conv1d(text_ctx, Wf_text_ctx, stride=1, padding='SAME') ctx_gate = tf.nn.conv1d(text_ctx, Wg_text_ctx, stride=1, padding='SAME') filter_output = filter_output + ctx_filter gate_output = gate_output + ctx_gate

lemonzi commented 7 years ago

@sonach For local conditioning, unless you have a very large network I would try with a simpler context, like left+current+right, or even just the current phoneme to see if the conditioning works at all.

El dc., 22 de febr. 2017 a les 22:51, sonach (notifications@github.com) va escriure:

@belevtsoff https://github.com/belevtsoff @jyegerlehner https://github.com/jyegerlehner Thank you for the discussing of F0 conditioning. I am using just the float point F0 value followed by 1x1 conv filters to every layer like this: Wf_lf0_ctx = self.weight_variable([1, 1, dilation_channels], 'ctx_lf0_filter') Wg_lf0_ctx = self.weight_variable([1, 1, dilation_channels], 'ctx_lf0_gate') lf0_filter = tf.nn.conv1d(lf0_ctx, Wf_lf0_ctx, stride=1, padding='SAME') lf0_gate = tf.nn.conv1d(lf0_ctx, Wg_lf0_ctx, stride=1, padding='SAME') filter_output = filter_output + lf0_filter gate_output = gate_output + lf0_gate So I think I should using embeddings instead of floating-points.

@GuangChen2016 https://github.com/GuangChen2016 I am working on local-conditioning also (for Chinese), and I am not getting good results either. My text context is about 900 dimensions(current phone, left phone, left-left phone, etc,). I am still struggling on this. My code for adding text context to every layer like this: Wf_text_ctx = self.weight_variable([1, text_ctx_dim, dilation_channels], 'ctx_text_filter') Wg_text_ctx = self.weight_variable([1, text_ctx_dim, dilation_channels], 'ctx_text_gate') ctx_filter = tf.nn.conv1d(text_ctx, Wf_text_ctx, stride=1, padding='SAME') ctx_gate = tf.nn.conv1d(text_ctx, Wg_text_ctx, stride=1, padding='SAME') filter_output = filter_output + ctx_filter gate_output = gate_output + ctx_gate

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/198#issuecomment-281887069, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5kg9w1RWaC9VFJ5uFxuOjODQU3pMks5rfQJIgaJpZM4LWxd_ .

-- Quim Llimona http://lemonzi.me

ucasyouzhao commented 7 years ago

@lemonzi I have try to concat F0 (float point) with linguistic feature. Finally, I got some chinese results. But it sounds not very vell. Specially, the tone is not very well.

weixsong commented 7 years ago

@ucasyouzhao , could you share some example waves generated with local condition? What kind of local condition did you add to the WaveNet network?

belevtsoff commented 7 years ago

Hey guys (@sonach, @ucasyouzhao ), as described in some paper, we've tried to "vectorize" F0 by concatenating F0's of neighboring frames in a vector. This gives the network a "broader view" on what it's being conditioned upon and also relaxes the problem of aligning lc with the corresponding sound. We've observed better results with this approach.

nakosung commented 7 years ago

Feeding network with left shifted conditions has enabled to see future. Receptive field can summarize neighbor frames so it is equivalent to concatenated frames with lower cost.

belevtsoff commented 7 years ago

Feeding network with left shifted conditions has enabled to see future

@nakosung Yeah, we tried that as well, but the network seems to be quite sensitive to the size of shift. Providing it with a short F0 history should allow the network to pick an optimal shift itself.

nakosung commented 7 years ago

@belevtsoff Network cannot pick an optimal shifting value for the future because it is allowed to see only next two frames. Providing shifted local conditions network can pick optimal shift value.

belevtsoff commented 7 years ago

@nakosung I guess we're talking about different things. Let me clarify. If you supply F0(t + T) as a scalar value (where T is a left shift you suggested), then for every activation, you have a single weight w to deal with F0(t + T). By adjusting T you can search for an optimal value. Now, if you vectorize F0 as F(t) := [F0(t - T), F0(t - T + 1), ..., F0(t), ... F0(t + T - 1), F0(t + T)], then each activation will do an inner product of F with a weight vector w, which should allow it to give preference to certain shift values by adjusting individual elements of w. Equivalently, it is the same as doing a 1 x (2T+1) convolution with local condition F0(t) instead of 1 x 1, as in the paper. Does that make more sense now?

nakosung commented 7 years ago

@belevtsoff I didn't mention what I modified. I modified that input is fed with local condition to be convoluted. In this way, local condition info doesn't need to be vectorized as you mentioned.

weixsong commented 7 years ago

From Heiga's slide, I'm confused of my understanding of local condition.

capture

My previous understanding of local condition is that: we use 4094 (receptive field) samples and corresponding local conditions of each of these samples to predict next sample.

But from the slide, it seems that use 4094 samples and local condition of next sample to predict next sample.

I'm not quite clear. If local condition is added as second way, the local condition should be left shifted by one step.

Any suggestion?

nakosung commented 7 years ago

I have fed local condition shifted by several frames along one hot encoded wave.

weixsong commented 7 years ago

@nakosung , if you shift local condition by several frames, for examples 2 frames, (this maybe 2 * 80 samples), then do you also left shift input samples by same samples to make samples and local condition are aligned as before?

And if you shift several frames, is this mean that you are not predict the next sample but actually predicting the sample after several frames? (say the sample after 2 * 80 samples)

nakosung commented 7 years ago

@weixsong I don't think local conditions and input samples are aligned exactly. The network is being trained to predict next samples according to history of input samples and local conditions, so the shift between input and local condition doesn't cause a problem.

As you mentioned, shifting several frames can inform the network 'the future' which can shape next samples accurately. Because we know the entire future of local conditions, there seems no reason to inform as much info as possible.

weixsong commented 7 years ago

@nakosung , thanks for reply.

I have aligned my sample and local condition. I'm sill not quite clear how to use local condition. From Heiga's slide, it seems that we train the model to predict next sample and also use the local condition of next sample.

So, for my aligned samples and local condition, I need to shift both samples and local condition left by one step. shift samples is aim to make the label aligned with network output shift local condition is aim to use the next sample's local condition to predict next sample.

Is my understanding correct?

weixsong commented 7 years ago

@nakosung , one more question.

currently network has 100,000 input samples, and we get 100,000 output. Is the first input sample is used in convolution to predict the first output? if the first sample used, then the first output's target label is actually the second sample. so we need to shift sample left by one to get correct labels(target), then to compute cross entropy.

Is that correct?