ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.41k stars 1.3k forks source link

Global condition and Local conditioning #112

Open thomasmurphycodes opened 7 years ago

thomasmurphycodes commented 7 years ago

In the white paper, they mention conditioning to a particular speaker as an input they condition globally, and the TTS component as an up-sampled (deconvolution) conditioned locally. For the latter, they also mention that they tried just repeating the values, but found it worked less well than doing the deconvolutions.

Is there effort underway to implement either of these? Practically speaking, implementing the local conditioning would allow us to begin to have this implementation speak recognizable words.

ibab commented 7 years ago

Yeah, it's definitely a planned feature. I'll get to it eventually, but I'd also accept contributions if someone is interested. A solution to this should also integrate with the AudioReader interface.

Zeta36 commented 7 years ago

Is somebody working on this already?

alexbeloi commented 7 years ago

I'm starting to work on it, I think I can get some basic implementation working over the next couple of days. Global part should be easy, and a dumb implementation (upsampling by repeating values) of local conditioning should be fast to implement as well.

This way, we can get to a stage where the net can produces some low-quality speech. Then work on improving the quality by adding more sophisticated upsampling methods.

The white paper also talks about using local conditioning features beyond just the text data, they do some preprocessing to compute phonetic features from the text. That would be nice to add later as well.

thomasmurphycodes commented 7 years ago

I agree global will be easier, should just be a one-hot vector representing the speaker. Am I thinking about this wrong that the local conditioning requires us to train on data sets that contain the phonetic data as a feature vector in addition to the waveform feature? What dataset are you thinking of using?

On Sat, Oct 8, 2016 at 1:56 PM, Alex Beloi notifications@github.com wrote:

I'm starting to work on it, I think I can get some basic implementation working over the next couple of days. Global part should be easy, and a dumb implementation (upsampling by repeating values) of local conditioning should be fast to implement as well.

This way, we can get to a stage where the net can produces some low-quality speech. Then work on improving the quality by adding more sophisticated upsampling methods.

The white paper also talks about using local conditioning features beyond just the text data, they do some preprocessing to compute phonetic features from the text. That would be nice to add later as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-252438852, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWYd6YWpFgmjPuTd12PWhrGHD5bdVks5qx9ligaJpZM4KKVOa .

alexbeloi commented 7 years ago

I was thinking to just use the raw text from the corpus data for local conditioning to start, just encode each character into a vector and upsample (by repeats) it to the number of samples in the audio file, not ideal but it's a start. Characters should be able to act as a really rough proxy for phonetic features.

Ideally, the raw text should be processed (perhaps via some other model) into a sequence of phonetic features and then that would be upsampled to the size of the audio sample.

thomasmurphycodes commented 7 years ago

I mean let's give it a shot and see what happens. Google Research has a bunch of papers over on there page about HMM-ing characters to phonemes, so we could look into a subproject where we try to implement that.

On Sat, Oct 8, 2016 at 2:25 PM, Alex Beloi notifications@github.com wrote:

I was thinking to just use the raw text from the corpus data for local conditioning to start, just encode each character into a vector and upsample (by repeats) it to the number of samples in the audio file, not ideal but it's a start. Characters should be able to act as a really rough proxy for phonetic features.

Ideally, the raw text should be processed (perhaps via some other model) into a sequence of phonetic features and then that would be upsampled to the size of the audio sample.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-252440212, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWZmBL3E8ObZmo44eoP_ht2lqW8Ufks5qx-AGgaJpZM4KKVOa .

nakosung commented 7 years ago

@thomasmurphycodes Could you post the list of papers?

thomasmurphycodes commented 7 years ago

Yeah will tomorrow when in the office, they're on a box I have there.

On Sat, Oct 8, 2016 at 10:38 PM, Nako Sung notifications@github.com wrote:

@thomasmurphycodes https://github.com/thomasmurphycodes Could you post the list of papers?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-252460549, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWQBOgH-ifllSzpZuJVXOafNVXramks5qyFOTgaJpZM4KKVOa .

ibab commented 7 years ago

I've also thought about just plugging in the raw text, but I'm pretty sure we would need at least some kind of attention mechanism if we want it to work properly (i.e. some way for the network to figure out which parts of the text correspond to which sections of the waveform).

thomasmurphycodes commented 7 years ago

I think that's the case for sure. They explicitly mention the convolution up-sampling (zero-padding) in the paper.

On Mon, Oct 10, 2016 at 7:30 AM, Igor Babuschkin notifications@github.com wrote:

I've also thought about just plugging in the raw text, but I'm pretty sure we would need at least some kind of attention mechanism if we want it to work properly (i.e. some way for the network to figure out which parts of the text correspond to which sections of the waveform).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-252593215, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWdC5vND-v17ykcajGuKc95ry3MD0ks5qyiHbgaJpZM4KKVOa .

wuaalb commented 7 years ago

In #92 HMM-aligned phonetic features are already provided. The upsampling/repeating values step is for going from feature vector per HMM frame to feature vector per time-domain sample..

rockyrmit commented 7 years ago

found: Merlin online, anyone use their training data here: CMU_ARCTIC datasets as linguistic features to train the wavenet?

rockyrmit commented 7 years ago

(from one of the WaveNet co-authors): Linguistic features which we used were similar to those listed in this document. https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/F0parametrisation/hts_lab_format.pdf

AFAIK there is no publicly available large TTS speech database containing linguistic features :-( So TTS research community (especially universities) often uses small ones.

One candidate is CMU ARCTIC databases with HTS demo. CMU ARCTIC has 4 US English speakers (about 1 hour per speaker). It is distributed with phoneme-level segmentations. HTS demo shows how to extract other linguistic features (described in the above-mentioned documents) from raw texts using festival. If you have any TTS experts / PhD researchers around, they can be familiar with how to use festival / HTS-demo.

let me know if anyone want to start to working on the linguistic features and local condition.

Zeta36 commented 7 years ago

I think that it is very important for this project not to die, that somebody public or share already his deployment about the local or global conditioning (even if it is unfinished). I'm afraid this project can get stuck in the current state if no one give a new step.

I've done my best but I'm afraid I have no the equipment (no GPU) nor the knowledge to do much more than what I've already done.

alexbeloi commented 7 years ago

@Zeta36, @ibab Apologies for the delays, the local/global conditioning has been taking a bit longer than expected.

I can push my progress to my fork by tonight. What I have right now runs, though for some reason training stalls at exactly iteration 116 (i.e. the process will not continue to the next iteration, despite default num_steps = 4000)

One of the main time sinks is that it takes a long time to train and then generate wav files to check if the conditioning is doing anything at all. No real way around that.

thomasmurphycodes commented 7 years ago

Possibly a memory overhead issue? Or is it converging?

Sent from my iPhone

On Oct 14, 2016, at 11:24, Alex Beloi notifications@github.com wrote:

@Zeta36, @ibab Apologies for the delays, the local/global conditioning has been taking a bit longer than expected.

I can push my progress my fork by tonight. What I have right now runs, though for some reason training stalls at exactly iteration 116 (i.e. the process will not continue to the next iteration, despite default num_steps = 4000)

One of the main time sinks is that it takes a long time to train and then generate wav files to check if the conditioning is doing anything at all. No real way around that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

alexbeloi commented 7 years ago

I figured out the issue with that, it was related to the filereader and queue, I created a second queue for the text files and was dequeuing text/audio together, but they became mismatched over time because of the audio slicing.

jyegerlehner commented 7 years ago

@alexbeloi

they became mismatched over time because of the audio slicing.

That sounds good. I'm glad you stumbled over that tripwire before I got to it :P.

I fear we may have duplicated some effort, but you are ahead of me. I hadn't got to the audio reader part yet. I've spent most of the time building out model_test.py so that we can test training and "speaker id"-conditioned generation. So perhaps we can combine your global conditioning with my test, or pick the better parts of both.

Have you by any chance incorporated speaker shuffling in your audio reader changes? I think we're going to need that so you might keep it in mind as you write that code, if not implement in the first PR.

alexbeloi commented 7 years ago

@jyegerlehner

The shuffling has been in the back of my mind. I haven't worked on it yet, definitely needs to get implemented at some point for the data to be closer to IID.

@ibab and all I've caught up my changes with upstream/master and pushed it to my fork. So far I have the model and training part done for both global and local conditioning but not the generation. I haven't been able to verify that the conditioning is working since I haven't gotten the generation working yet.

I want to clean it up more and modularize the embedding/upsampling before making a PR but if anyone wants to hack away at it in parallel, feel free.

https://github.com/alexbeloi/tensorflow-wavenet

running the following will train model with global conditioning as speaker_id from the VCTK corpus data, and local conditioning from the corresponding text data. python train.py --vctk

The way I've implemented it, the conditioning gets applied to each dilation layer (not just the initial one), it's not clear to me from the paper if that's the intended method.

jyegerlehner commented 7 years ago

@alexbeloi I'm contemplating working from your branch, and adding my test on top of it. Looking at your branch I notice at a few things:

https://github.com/alexbeloi/tensorflow-wavenet/blob/master/wavenet/model.py#L560

Here I was using tf.nn.embedding_lookup to go from the integer that specifies "speaker_id", not tf.one_hot.

compactness I think one problem with using tf.one-hot instead tf.embedding_lookup is its effect on the size of the 'gcond_filter' and 'gcond_gate' parameter tensors. These occur in every dilation layer. And the size of each is global_condition_channels x dilation channels. When using tf.one_hot, global_condition_channels = the number of mutually exclusive categories, whereas with tf.embedding_lookup, the global_condition_channels specifies the embedding size, and can be chosen independently of the number of mutually-exclusive categories. This might be a size 16 or 32 embedding, as opposed to a size 109 vector (to cover the speakers in VCTK corpus).

generality Another problem is generality: one might wish to do global coditioning where there isn't an enumeration of mutually exclusive categories upon which one is conditioning. Your approach works fine where there are only 109 speakers in the VCTK corpus, but what if one wishes to condition upon some embedding vector produced by, say, seq2seq. Or a context stack (2.2 in the paper). I don't think the number of possible character sequences that correspond to valid sentences in a language could feasibly be enumerated. But you can produce a dense embedding vector of fixed size (say, 1000) that represents any sentence. The h in the equation at the bottom of page 4 in paper can be any vector you want to condition on, but with the tf.one_hot it can only be an input to the WaveNetModel as an integer enumerating all possible values.

local conditioning: separate PR? I think it's usually good practice to break up large changes into smaller ones, so as not to try to "eat the elephant" all in one sitting. Each of global and local conditioning is complicated enough a change I think they are better in separate PRs. I'd suggest putting them in their own named branches rather than your master.

local conditioning: hard-wired to strings https://github.com/alexbeloi/tensorflow-wavenet/blob/master/wavenet/model.py#L566

I'm guessing your use of tf.string_to_hash_bucket_fast() is intended to process linguistic features (which come as strings? I don't really know). But the paper also mentions local conditioning for context stacks (section 2.6), which will not be strings, but a dense embedding vector y in equation at the top of page 5.

local conditioning: upsampling/deconvolution Your tf.image.resize_images I think does what they said doesn't work as well (page 5, last paragraph of 2.5) I think this needs to be a strided transpose convolution (AKA deconvolution).

So in short, I think what I'm proposing is that global_condition vector h and local_condition vector y come into the WaveNetModel class as dense vectors of any size from any source, and that any encoding (e.g. tf.one_hot or tf.nn.embedding_lookup) be done outside the WaveNetModel. Then, when we're working with VCTK we can do one_hot or embedding_lookup to produce global_condition, but when we're dealing with other things that produce a dense vector we can accommodate that too.

I think the approach you are taking works as long as all we care about is the VCTK corpus (or a few music genres) without context stacks. But context stacks are definitely on my road map so prefer not to see local conditioning hard-wired to strings.

Maybe the wider community is happy with your approach and if so perhaps they can speak up.

BTW these are my initial thoughts; I often miss things and am very persuadable.

Zeta36 commented 7 years ago

@alexbeloi you are doing a great job!!

I have replicated my text WaveNet implementation (#117) but using your model modifications for global and local conditioning. Well, after training the model using texts in Spanish and English (being ID = 1 the Spanish texts and ID=2 the English ones), I could generate later text in any language independently by using the parameter --speaker_id equals to 1 or 2!!

This mean that your global conditioning is working perfectly!!

Stay working on it!!

I would like to mention one thing about your code. In the AudioReader, when we iterate after reading the audio, when we cut the audio into buffers of self.sample_sizes, the ID and the text sometimes start to mixing badly.

Imagine for example, that we read from a folder with 5 wav files and that the load_vctk_audio() returns a tuple with the audio raw data, the ID of the speaker, and the plain text. If we set self.sample_sizes to None then everything works fine because we'll feed sample_placefolder, id_placeholder and text_placeholder correctly (this whole audio raw is feed in the sample_holder in a time). But, and this is important, if we set a sample_sizes, then the audio is going to be cut and the id and the text in some cases start to mix badly, and the placeholders start to be fed incorrectly: where for example a sample_holder is feed with raw data from two different wav files and ID and text being badly informed.

I had this problem with my text prove, where in some moments I had the sample_holder with both Spanish and English texts at the same time.

alexbeloi commented 7 years ago

@jyegerlehner Thanks for the feedback, I agree with everything you've pointed out. My plan was to do hacky vctk specific embeddings, get the math right, then go back and replace with more generic embeddings/upsampling.

@Zeta36 Thanks verifying some of the things work! I'll have to look at what you say regarding the sample_size. I thought the way I had it, it was queuing the same global condition for each piece that is sliced and queued from the sample.

Zeta36 commented 7 years ago

@alexbeloi, imagine you have 5 wav files each with a different size. You yield in load_vctk_audio() the audio raw vector, the speaker id and the text the wav is saying. If you fill the sample_holder, the id holder and the text holder in one time (self.sample_size equals to None) all is correct. But if you set a sample_size to cut in pieces, you have this problem:

1) We have 5 raw audio and we start the first iteration with the buffer clean. We append then to the buffer the first audio raw vector and we cut the first piece of buffer with a certain sample size, after what we feed the three holders. We then repeat again and cut another piece of sample size and feed again.

2) We repeat this process while len(buffer_) > self.samplesize, so when after cutting a piece it results in len(buffer) being less or equal to self.samplesize we ignore this last piece (this is the real problem) and we restart the loop with a new audio raw file and a new speaker id and text but now we have NOT the buffer clean as in the first bucle, but it now it has the remain piece of the last audio raw as we have seen.

In other words, when we start cutting an audio vector, the last piece will be ignored and will stay in the buffer_ to the next iteration. This is not a mayor problem when we are working without conditioning as until now, but it cannot stay in this way with conditioning, because in the second iteration you begin to mix audio raw data from different speakers and text.

A fast solution will be simply cleaning the buffer at the beginning of each iteration in the line right after: for audio, extra in iterator: using `buffer = np.array([])`

This would be a solution but this will ignore the last piece of every audio file what may be not a good idea.

Regards, Samu.

alexbeloi commented 7 years ago

@Zeta36 Ah, I see now.

If we don't want to drop the tail piece of audio, we can pad the it with silent noise, queue it, and have the buffer cleaned as you suggest. Or the functionality (between dropping and padding) can be determined by whether silence_threshold is set or not.

sonach commented 7 years ago

@alexbeloi Good job! (1) I notice your code for using local_condition: conv_filter = conv_filter + \ causal_conv(local_condition, weights_lcond_filter, dilation) I think the local_condition doesn't need to do dilation, it is just a 1x1 conv, and doesn't need to do causal_conv, just conv1d is OK. So what is your consideration here? (2)

The way I've implemented it, the conditioning gets applied to each dilation layer (not just the initial one), it's not clear to me from the paper if that's the intended method.

I think what you done is the intended mothod. Vg,k*y means that every layer(k is the layer index) has seperate weights.

chrisnovello commented 7 years ago

Exciting thread here!

fwiw: I re-recorded one of the entries from VCTK in my own voice, and got decent babble results. I used the VCTK txt and recording style with the intention of later training on the full corpus + my voice in the mix. Planning to do more recordings, and I'd be happy to do them in a way that helps generate data with linguistic features (by adding markup myself, and/or reading passages designed with them in mind, etc). I might be able to find some others to help on this as well. Let me know if any of this would be useful!

thomasmurphycodes commented 7 years ago

That's a great idea Chris. I wonder if we could create an expanded multi-speaker set on the VCTK text within this project.

On Mon, Oct 17, 2016 at 2:59 AM, Chris Novello notifications@github.com wrote:

Exciting thread here!

fwiw: I re-recorded one of the entries from VCTK in my own voice, and got decent babble results https://soundcloud.com/paperkettle/wavenet-babble-test-trained-a-neural-network-to-speak-with-my-voice. I used the VCTK txt and recording style with the intention of later training on the full corpus + my voice in the mix. Planning to do more recordings, and I'd be happy to do them in a way that helps generate data with linguistic features (by adding markup myself, and/or reading passages designed with them in mind, etc). I might be able to find some others to help on this as well. Let me know if any of this would be useful!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/112#issuecomment-254127463, or mute the thread https://github.com/notifications/unsubscribe-auth/AEKqWaRdSW95O2ZdKquFYekvxd5ZakV6ks5q0xzMgaJpZM4KKVOa .

linVdcd commented 7 years ago

@alexbeloi Hi, I used your code to train VCTK. But when I tried to generate a wav file, I got an error. This is the way I used the generate.py file: python generate.py --wav_out_path=out.wav --speaker_id=2 --speaker_text='hello world' --samples=16000 --logdir=./logdir/train/2016-10-18T12-35-15 ./logdir/train/2016-10-18T12-35-15/model.ckpt-2000

And I got the error: Shape must be rank 2 but is rank 3 for 'wavenet_1/dilated_stack/layer0/MatMul_6' (op: 'MatMul') with input shapes: [?,?,32], [32,32].

Did i miss something? Thank you.

alexbeloi commented 7 years ago

@lin5547 Hi, thanks for testing things. You haven't missed anything, the generation part is still a work-in-progress unfortunately, I'm looking to have things working by the end of the week.

@sonach You're right, the paper says this should be just a 1x1 conv, will make the change.

sonach commented 7 years ago

@alexbeloi

The way I've implemented it, the conditioning gets applied to each dilation layer (not just the initial one), it's not clear to me from the paper if that's the intended method.

I discuss this with an ASR expert. In speaker adaption application, the speaker ID vector is applied to every layer instead of the first layer only. So your implemention should be OK:)

bryandeng commented 7 years ago

@rockyrmit If we use linguistic features in HTS label format, Merlin's front-end provides an out-of-the-box solution to the conversion from labels to NumPy feature vectors.

https://github.com/CSTR-Edinburgh/merlin/blob/master/src/frontend/label_normalisation.py#L45

jyegerlehner commented 7 years ago

@alexbeloi @Zeta36

I think you two were both working on some global conditioning code. I've got a branch with an implementation of global conditioning with a working test here. It does not implement anything in the file reader that reports back ID. I implemented the test with toy data, where we globally condition on a speaker id, where speaker id is 0, 1 or 2, and it generates a sine wave of a different frequency depending on which ID is chosen.

I'd be just as happy to use one of your implementations instead of mine, especially since I think you've probably got the AudioReader modified to report speaker id and I don't have that. But I'd like to preserve the tests that I wrote.

So I'd like to know if you are still planning on contributing global conditioning, and have ideas on how to merge our various contributions.

vasquez75 commented 7 years ago

Greetings @alexbeloi and/or @Zeta36!

Can you give me detailed steps of how I can make this thing say actual words? I downloaded the repository to my local machine, created a subfolder folder named "VCTK-Corpus" in the tensorflow-wavenet-master directory, then threw in some wav files into the VCTK-Corpus folder I created, and ran python train.py --data_dir=VCTK-Corpus. I am now able to generate the alien sounds when I run python generate.py --samples 16000 model.ckpt-1000, but really I'd like to hear it talk. Note: I'm not using the real VCTK corpus. I have a bunch of wav files of my own and the text to go with them. Is there a specific set of pre-processing that I need to do?
A step by step would be great. Let me know, thanks!

NickShahML commented 7 years ago

I just wanted to comment on this real quick with a few naive ideas.

From a ByteNet perspective, the decoder is conditioned on both the source network's input AND the output of the decoder's previous timestep.

Idea 1

Therefore, one simple strategy I had for local conditioning, is just to sum the source network's output and the regular input. This does prevent the network from fully learning the distinct features of them though.

Idea 2

Idea two would be to concat the inputs. In this way we would expand the "width" of the image by a factor of 2. However, the same convolutional kernal would be applied to both types of inputs.

Idea 3

Use tf.atrous_conv2d to include a height dimension rather than just width. The height could incorporate multiple signals (not just 2). This in my opinion would be the best option and comes at the cost of doubling parameter sizes.

Thoughts on these?

jyegerlehner commented 7 years ago

@LeavesBreathe It's frustrating that the ByteNet paper doesn't spell out how the decoder brings the source network's output into the decoder's "residual multiplicative" unit just shows a single vector coming in. Or did I just miss it? I guess you didn't see it either, which is why we're contriving our own way to do it.

Your ideas sound plausible to me.

On #3, if I understand you, the "extra" convolution seems redundant. The s vector already is the result of a time convolution, and then the decoder network is doing it's own convolutions. I don't imagine it would hurt anything. But makes the code more complicated.

A simpler thing, and probably my favorite at the moment, is like your idea #2, except couldn't we just concatenate s and t along channels? So in Figure 1, if s_8 value has m channels, and t_8 value has n channels, then the concatenated result has m+n channels, and that concatenation result is what flows through the res blocks in Figure 3, such that 2d = m+n, where 2d is as labelled in the Figure 3.

NickShahML commented 7 years ago

@jyegerlehner I'm with you -- idea 2 is the simplest to implement that also yields some promise. Code wise, idea 3 would be an entire rewrite as suddenly all tensors would now need to be 4d.

I do believe if you look at fig 3 in the bytenet paper, they do have 2d that they then use a 1x1 conv to reduce to 1d. I honestly don't understand why it is done this way -- why don't they just keep the 2d the whole way? To keep the computation cheaper?

To be clear, we would concatenate these two inputs along dimension 2 in the input_batch tensor correct?

Right now for text, my inputs into the entire wavenet are: [batch_size, timesteps, embedding_size]. If we concatenate the inputs, then we now have: [batch_size, timesteps, embedding_size*2]

I will also work on implementing symmetric padding in the meantime for those interested in the source network. I'm currently working on bytenet here: https://github.com/LeavesBreathe/bytenet_tensorflow

jyegerlehner commented 7 years ago

@LeavesBreathe

2d that they then use a 1x1 conv to reduce to 1d. I honestly don't understand why it is done this way -- why don't they just keep the 2d the whole way? To keep the computation cheaper?

I bet it's mostly to save memory. Each one of those many sigmoid, tanh and element-wise addition and multiplication ops in the multiplicative residual unit produces another tensor. That's at least umpteen of them. Making them smaller is a big memory savings. Plus, I think residual units are different than non-residual layers or blocks: any lower dimensional bottleneck in a non-residual layer will force you to throw away information if you can't compress it, but you don't have that problem with residual layers, since everything that was there in the input is still there at the output. Anyway, that's my theory.

To be clear, we would concatenate these two inputs along dimension 2 in the input_batch tensor correct?

Yeah that's what I was thinking. But.. I think we missed something. Notice in Figure 2 where they talk about what happens when the source and target streams are of different length. As when the German source sentence ends before the English target. They say they simply don't condition the decoder on the source any more. Which means the source contribution can just go away. So what would we do in our concatenation scheme? Maybe we could just set the missing source contribution to zeros. Or we could go back to your idea 1, and instead of concatenating, sum s and t. That "feels" like it might be better to me, since they would have the same embedding space.. and would make more sense for the source contribution to just go away... maybe?

NickShahML commented 7 years ago

They say they simply don't condition the decoder on the source any more. Which means the source contribution can just go away. So what would we do in our concatenation scheme? Maybe we could just set the missing source contribution to zeros. Or we could go back to your idea 1, and instead of concatenating, sum s and t. That "feels" like it might be better to me, since they would have the same embedding space.. and would make more sense for the source contribution to just go away... maybe?

@jyegerlehner I pretty confident that in the case where you run out of source input timesteps, they pad the actual source network inputs with zeros. From the paper:

At each step the target network takes as input the corresponding column of the source representation until the target network produces the end-of-sequence symbol. The source representation is zero-padded on the fly: if the target network produces symbols beyond the length of the source sequence, the corresponding conditioning column is set to zero. In the latter case the predictions of the target network are conditioned on source and target representations from previous steps. Figure 2 represents the dynamic unfolding process.

So I don't believe summing is the way to go. Instead, I think the second idea and concating them is the way to approach them.


What you said about the residual comments and memory savings makes alot of sense to me, especially since d was reported to be 892 (not sure why they chose that number). I think I'll build this block on Sunday in my fork.


Also...I'm struggling to convert the causal convolution so that it will accept dilations from both sides just like the source network. Maybe this is simpler than I think, but if you could help with that, this would be useful for those who want to use wavenet as a classifier.

If there was a dilation rate set to 8, then 4 holes would go to the left and 4 would go to the right. This is depicted in figure 1 in the source network. Working on it here:

https://github.com/LeavesBreathe/bytenet_tensorflow/blob/master/bytenet/convolution_ops.py

jyegerlehner commented 7 years ago

I pretty confident that in the case where you run out of source input timesteps, they pad the actual source network inputs with zeros.

Well sure, zeros. But that merely begs the question: are the zeros being added, or concatenated, to the target embedding?

especially since d was reported to be 892

Thanks. I never noticed that. I must have the reading-comprehension/attention-span of a gnat.

Also...I'm struggling to convert the causal convolution so that it will accept dilations from both sides just like the source network.

I don't see what the problem is. Sure, the decoder/target network is a WaveNetModel. It does causal/masked convolutions. The encoder/source network is not causal. It's just a run-of-the-mill conv net (doing convolutions in time). Well not quite run-of-the-mill. I think you can work out the filter width, stride and dilation from the red part of Figure 1. Or maybe from the words they wrote. I dunno. I tend to look at the pictures. They talk about how the input is n-gram encoding (which sounds complicated) and so on blah blah. I haven't looked into it closely. And there's that whole "sub-batch-normalization" which I would probably skip the first time around because who knows what that means. I haven't found batch-normalization to be very helpful but that could just be me.

if you could help with that

I'm still working on global conditioning for wavenet which we haven't merged yet. I intend to train WaveNet on the full VCTK corpus next. One thing at a time. So I'm afraid in the near term you're going to have to deal with this without me. But if you can wait long enough I might get back to this.

NickShahML commented 7 years ago

@jyegerlehner Will respond to all of this on Sunday. Unfortunately, I can't work today -- I will try using the tf.atrous_conv2d for the source network as I believe it is more efficient. With tf.conv1d you're still using tf.conv2d internally.

Also I read this paper probably 10 times so i definitely missed the 892 the first 5 times.

ibab commented 7 years ago

@LeavesBreathe: I might be misunderstanding how you want to use tf.atrous_conv2d, but note that its rate parameter sets the dilation rate for both dimensions at the same time, so it might not give you what you want.

NickShahML commented 7 years ago

@ibab and @jyegerlehner I'm back working on this. I wanted to try tf.atrous_conv2d because it would save in computation (and coding). I understand that the rate parameter is used for both height and width. In our case, we are just interested in the width.

However, if we set the height to just 1 (like we are currently doing), this wouldn't this op work successfully? It can dilate all it want on the height dimension but there is only one height it can receive values from. Perhaps this is a bad strategy to approach this with.

I'll think about this more -- perhaps I need to do the conv1d approach that you guys led with. I'm just confused as to how to make it a non-causal yet atrous (or dilated) conv1d based upon code. The whole point of this would be for the source network. Let me work on this more and get back to you tomorrow.

alexbeloi commented 7 years ago

@ibab @jyegerlehner @Zeta36

Apologies for my absence/delay, I was in the middle of moving when I started working on this. Hopefully we can get the ball rolling on this again.

Zeta36 commented 7 years ago

Hello again, @alexbeloi :).

@jyegerlehner did a lot about global conditioning, maybe you can use his work instead of continuing with your (now old) branch.

Regards.

GuangChen2016 commented 7 years ago

@alexbeloi @jyegerlehner @Zeta36 @ Hello, guys! Has any one of you worked out the local conditional part? I am still working on this, but I still don't have any idea about how to insert the the textual information into the network. One problem is that we cut the radio into a fixed pieces, how can the textual information match with the fixed pieces? I am confused with that,but I am quite interested at this, can you give me some advice or suggestion?

Best Regards.

Whytehorse commented 7 years ago

Input -> HMM/CNN -> Output Text -> HMM/CNN -> Speech The training data should be the actual text and phonetic data and wav files and speaker id From the article https://deepmind.com/blog/wavenet-generative-model-raw-audio/ "Knowing What to Say

In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet. This means the network’s predictions are conditioned not only on the previous audio samples, but also on the text we want it to say.

If we train the network without the text sequence, it still generates speech, but now it has to make up what to say. As you can hear from the samples below, this results in a kind of babbling, where real words are interspersed with made-up word-like sounds:

ttslr commented 7 years ago

@Whytehorse Have you implemented local condition for the WAVENET? can you share the code?

Thank you very much! Best Regards!

Whytehorse commented 7 years ago

I don't have any code yet since I'm still working on porting tensorflow to my non-nvidia gpu. Anyway, my understanding is that you need to get some pre-made data set which has recorded speech that is tagged in time with text. There already exist such data sets plus you can use movies with subtitles and/or use any tts api

JahnaviChowdary commented 7 years ago

@alexbeloi Is the local conditioning for generation part done?

AlvinChen13 commented 7 years ago

Any progress for local conditioning part?