Open ibab opened 7 years ago
I've got a laundry list of things I'd like to try and plan to explore the space of possibilities. Here are a few that come to mind.
2 I've also been thinking that the number of channels can now vary throughout the convolutional stack
Perhaps you're seeing something I'm not, but I don't see how the channels can vary. The res block's summation forces the input shape to equal the output shape, so num channels can't change. Oh, or perhaps you are saying within a single block, the channels can vary, so long as we end up with the same shape at input and output of any single block?
3 That would conflict with the presence of the gated activation unit and the 1x1 convolution inside each of the layers, right?
Err.. Sorry I probably wasn't expressing clearly what I intended. I don't see any conflict at all. Everything would be exactly as it is now. Except one thing: the input to the net would be n samples x 1 scalar amplitude, instead of n samples x 256 one-hot encoding. The initial causal conv filter will still produce the same number of channels it does now, so nothing downstream would see any impact.
2: Yeah, I was thinking about changing the number that's referred to as dilation_channels
in the config file on a per-block basis, but got confused. This would require #48.
4: Ah, so basically we would allow the network to learn its own encoding of the floating point samples. Wouldn't that make the quantization procedure unnecessary?
In your one-hot proposal, we would be assuming that the network tries to learn how to perform a quantization. In the current implementation, the network learns how to encode a random variable with a multinomial distribution that has temporal dependencies across trials.
The obvious model that we would all like is using the float value with "classical" filters, but then we need to choose a loss function. The authors said that most loss functions on floats assume a particular distribution of the possible values (I think the squared loss corresponds to a normal distribution), while the multinomial from the one-hot encoding makes no assumptions at the expense of having a finite set of possible values. Apparently, lifting this constraint gives better results despite the quantization noise.
The downside is that the SNR will always be kinda high because of the 8-bit resolution, so at some point we should be able to find a better model -- or scale it up to a one-hot encoding with 60k+ categories.
El dl., 19 set. 2016 a les 17:14, jyegerlehner (notifications@github.com) va escriure:
- I've also been thinking that the number of channels can now vary throughout the convolutional stack
Perhaps you're seeing something I'm not, but I don't see how the channels can vary. The res block's summation forces the input shape to equal the output shape, so num channels can't change. Oh, or perhaps you are saying within a single block, the channels can vary, so long as we end up with the same shape at input and output of any single block?
- That would conflict with the presence of the gated activation unit and the 1x1 convolution inside each of the layers, right?
Err.. Sorry I probably wasn't expressing clearly what I intended. I don't see any conflict at all. Everything would be exactly as it is now. Except one thing: the input to the net would be n samples x 1 scalar amplitude, instead of n samples x 256 one-hot encoding. The initial causal conv filter will still produce the same number of channels it does now, so nothing downstream would see any impact.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/47#issuecomment-248129008, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5myUpJB0EKsftfed4CTAB2S3AZpzks5qrvtIgaJpZM4KAi-o .
@ibab
Wouldn't that make the quantization procedure unnecessary?
No, we'd still be using 1) softmax at the output to provide a discrete probability distribution over the 256 quantized values, 2) the quantization procedure for producing target output during training, and 3) and sampling from the discrete prob distribution of the softmax in order to produce the output (followed by the inverse of the companding quantization to get it back to a scalar amplitude).
@lemonzi
In your one-hot proposal...
I sincerely think what I'm proposing is being misunderstood. It's probably my fault :). I understand and like the rationale given for the softmax output in section 2.2 of the paper; I wasn't proposing getting rid of it. I don't have a one-hot proposal; the current code uses a one-hot encoding as an input.
But no matter; at some point perhaps I'll get around to trying what I propose, and bring it back here if it works any better. It does seem more likely, based on my reading of the paper, that it's what they are doing than what the current code does. I don't have high confidence though, and of course could easily be wrong.
This is what I understood:
- floating-point input with raw audio (no mu-law)
- convolution with 1 input channel and N output channels
- N-to-M channel layers
- layer that aggregates the skip connections with 256 outputs
- softmax
- cross-entropy against a mu-law + quantization encoding of the input
It's worth a shot, but the network is no longer an auto-encoder.
BTW, I don't like sampling from the multinomial in generate.py
; I'd rather generate floats from a given distribution and quantize them, which is closer to feeding it a given raw audio seed.
@ibab, how about a tag for these "strategical issues" and one issue per idea?
@jyegerlehner Or maybe you meant mu-law encoding the input but skipping the quantization to avoid damaging the SNR?
This is what I understood
OK you understood me well then. Perhaps I was misunderstanding you.
It's worth a shot, but the network is no longer an auto-encoder.
I'm trying to see in what sense it was ever an auto-encoder. I don't think it is/was.
BTW, I don't like sampling from the multinomial in generate.py; I'd rather generate floats from a given distribution and quantize them
Not sure I follow your alternative to the softmax. I was mostly trying to stick to figuring out what the authors had most likely done in their implementation. I bet we all have ideas about different approaches we'd like to try out.
Or maybe you meant mu-law encoding the input but skipping the quantization to avoid damaging the SNR?
No I never thought of that.
how about a tag for these "strategical issues" and one issue per idea?
Right, I feel bad about hijacking ibab's thread. I like the strategic issues tag idea. I prefer not to clutter this thread any more with this one topic.
As the topic of this issue is just a general "What else should we try?" I think the discussion is perfectly fine 👍 But feel free to open new issues to discuss strategies. I can tag them with a "strategy" label.
We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset.
Right. I also think when we train on multiple speakers we need to shuffle the clips. I fear we may be experiencing catastrophic forgetting. That little sine-wave unit test I wrote shows how quickly it can learn a few frequencies, which makes me think once it starts getting sentences from a single speaker, it forgets about the pitch and other characteristics of the previous speaker(s).
But single-speaker training is less ambitious an easier first step.
My two cents:
Scalar input: WaveNet treats audio generation as an autoregressive classification task. The model requires the last step's output to be provided at the input. I don't think there's much to be gained by providing scalar floating point values at the input. They would still need to be reduced to 8-bit resolution (or as @lemonzi mentions you'd be asking the model to learn quantization). You might save some computational cycles at the first layer. However I think then the scale of the input would need to be considered more closely.
Input shuffling: this would probably be very useful.
Silence trimming: Shouldn't the model be allowed to see strings of silent samples? Otherwise it will learn to generate waveforms having more discontinuities. I suggest that the degree of trimming is decided as a function of the size of the receptive field. E.g. truncate silences to no less than 75% of the receptive field.
Oh, that makes sense. It's classifying the next sample, not encoding the sequence as a whole.
The trimming is currently applied to the beginning and end of the samples, not to the gaps in between speech. If there are long silence chunks in the samples, what could make sense is to split them in two rather than stripping out the silence.
I've just managed to generate a sample that I think sounds pretty decent: https://soundcloud.com/user-952268654/wavenet-28k-steps-of-100k-samples
This is using Tanh instead of ReLU to avoid the issue that the ReLU activations eventually cut off the network.
I stopped it at one point to reduce the learning rate from 0.02
to 0.01
but it doesn't look like it had a large impact.
I started generating when the curve was at about 28k steps.
I used only two stacks of 9 dilation layers each:
{
"filter_width": 2,
"quantization_steps": 256,
"sample_rate": 16000,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
1, 2, 4, 8, 16, 32, 64, 128, 256],
"residual_channels": 32,
"dilation_channels":16,
"use_biases": false
}
Nice!
I've noticed that generating from the same model doesn't always produce interesting output. But if I start off with an existing recording, it seems to work very reliably: https://soundcloud.com/user-952268654/bootstrapping-wavenet-with-an-existing-recording
Considering that the receptive field of this network is only ~1000 samples, I think the results sound quite promising.
Can you test with argmax instead of random . choice?
On Wed, Sep 21, 2016, 17:46 Igor Babuschkin notifications@github.com wrote:
I've noticed that generating from the same model doesn't always produce interesting output. But if I start off with an existing recording, it seems to work very reliably:
https://soundcloud.com/user-952268654/bootstrapping-wavenet-with-an-existing-recording
Considering that the receptive field of this network is only ~1000 samples, I think the results sound quite promising.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/47#issuecomment-248752588, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5jeApDDLwiBeK-zYyZDdEmKvPTs0ks5qsaWkgaJpZM4KAi-o .
@lemonzi: After swapping out random.choice
with argmax, it always returns the same value. I think that makes sense, as staying at the same amplitude is the most likely thing to happen at the resolution we work with.
Interesting...
El dc., 21 set. 2016 a les 17:58, Igor Babuschkin (notifications@github.com) va escriure:
@lemonzi https://github.com/lemonzi: After swapping out random.choice with argmax, it always returns the same value. I think that makes sense, as staying at the same amplitude is the most likely thing to happen at the resolution we work with.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/47#issuecomment-248755247, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5lUKLgy9yHbvna-8BQFLuht0GWvyks5qsah1gaJpZM4KAi-o .
When calculating the mean amplitude with
sample = np.int32(np.sum(np.arange(quantization_steps) * prediction))
it just produces noise for me.
Very cool work, guys! As a text-to-speech person, I am excited to see where this effort may lead.
As far as generating good-sounding output, I believe I have some thoughts to add regarding point 3 in @jyegerlehner's list, on the use of floating point values vs. one-hot vectors for the network inputs. I hope this is the right issue in which to post them.
I met with Heiga Zen, one of the authors of the WaveNet paper, at a speech synthesis workshop last week. I quizzed him quite a bit on the paper when I had the chance. My understanding is that there are two key motivations for using (mu-law companded) one-hot vectors for the single-sample network output:
Note that both these key concerns only are relevant at the output layer, not at the input layer. As far as the input representation goes, scalar floating-point values have several advantages over a one-hot vector discrete representation:
Seeing that WaveNet is based on PixelCNNs, it might be instructive to consider how the latter handle and encode their inputs There appears to be a working implementation of pixelCNNs on GitHub, but I haven't looked sufficiently deeply into it to tell how they encode their input.
Has everyone been reproducing ibab's results? I got a result similar to his, but I think it sounds a bit smoother; I'm guessing because the receptive field is a little bigger than his.
2 seconds: https://soundcloud.com/user-731806733/speaker-p280-from-vctk-corpus-1
10 seconds: https://soundcloud.com/user-731806733/speaker-280-from-vctk-corpus-2
{
"filter_width": 2,
"sample_rate": 16000,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2],
"residual_channels": 32,
"dilation_channels":32,
"quantization_channels": 256,
"skip_channels": 1024,
"use_biases": true
}
[Edit] After mortont comment below: I used learning_rate=0.001.
I made a copy of the corpus directory, except I only copied over the directory for speaker p280. I stopped training at about 28K steps, to follow ibab's example. Loss was a bit lower than his, around 2.0-2.1.
I think to get pauses between words and such we need to a wider receptive field. That's my next step.
By the way, anyone know how to make soundcloud loop the playback, instead of playing music at the end of the clip, like ibab did? Pro account needed for that?
[Edit] Here's one from a model with that has about 250 mSec receptive field, trained for about 16 hours: https://soundcloud.com/user-731806733/generated-larger-1
Those results sound great!
We should consider linking to them from the README.md
to demonstrate what the network can do.
It seems likely that we will be able to reproduce the quality of the DeepMind samples with a higher receptive field.
On soundcloud, you can set an audio clip to repeat in the bar at the bottom, but I don't think this will affect other listeners. Not sure why my clip was on repeat by default for you.
This is definitely the best result yet! What commit did you use to achieve this @jyegerlehner? I tried reproducing it using the same hyperparameters and only speaker p280 from the corpus, but my model hasn't gone under a loss of 5 after 26k steps.
@mortont I'm not sure exactly which commit to this branch it was exactly: https://github.com/jyegerlehner/tensorflow-wavenet/tree/single-speaker But most are trivial and frankly I don't think it matters. I haven't observed it breaking at any point.
I've started training a newer model with latest updates from master and it is working fine. I don't have any "special sauce" or changes to the code relative to master that I can think of. The only reason for a separate branch for it is to allow me to change the .json file and add shell scripts, and be able to switch back to master without losing files.
I'm trying to imagine why you would have loss stuck at 5 and... can't think of a good reason. Perhaps compare the train.sh and resume.sh in my branch to the command-line arguments you are supplying and see if there's an important difference? Learning rate perhaps?
[Edit]: I observe the loss to start dropping right away, within the first 50 steps. Loss drops to < 3 rapidly well before 1K steps. So if you don't see that, I think something is wrong.
Looks like it was learning rate, I changed it from 0.02 to 0.001 and it's now steadily dropping, thanks!
I also got some good results using the following wavenet_params.json
only training on speaker p280:
{
"filter_width": 2,
"sample_rate": 16000,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2],
"residual_channels": 32,
"dilation_channels": 32,
"quantization_channels": 256,
"skip_channels": 1024,
"use_biases": true
}
with a learning rate of 0.001. I trained it for ~15k steps with a sample size of 70k https://soundcloud.com/travis-morton-654696702/mortont-generated-audio
good job!
Another hyperparameter datapoint: https://soundcloud.com/travis-morton-654696702/mortont-generated-audio2
This clip was generated using the following wavenet_params.json
{
"filter_width": 2,
"sample_rate": 16000,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64],
"residual_channels": 32,
"dilation_channels": 32,
"quantization_channels": 256,
"skip_channels": 1024,
"use_biases": true
}
and used a seed to kick off generation (p280_001.wav
specifically). The model was trained for ~12k steps only on speaker p280 and got to a loss of ~2.1 with learning rates of 0.001, 0.0005, and 0.0002 in that order. Sample size was 100k and receptive field should be 328ms.
I'm thinking the receptive field is large enough now, but this still doesn't sound nearly as good as the DeepMind examples. Maybe increasing residual, dilation, and skip channels further would help. Or maybe it needs to train on multiple speakers instead of just one. Thoughts?
@mortont: Yeah, I think training on multiple speakers (and using the conditioning technique from the paper) might make a difference. We should also try leaving some of the silence between the samples, as the DeepMind results sound less "busy" than the ones we are generating. Regularization might also make a difference in how well we can train the network.
The paper did mention that training on multiple speakers improved each single speaker's quality.
Here's one with 500 mSec receptive field, as of about 40K training steps (speaker 280). https://soundcloud.com/user-731806733/tensorflow-wavenet-500-msec
I trained with learn rate=0.001 up to about 28K then switched to learn_rate=0.0001. I think it's still improving, and will switch it to learn_rate= 0.00001 in a while.
{ "filter_width": 2, "sample_rate": 16000, "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512], "residual_channels": 32, "dilation_channels":32, "quantization_channels": 256, "skip_channels": 1024, "use_biases": true }
@jyegerlehner: That's definitely the best sample I've heard so far. Maybe we can improve the result if we try processing the input a bit (e.g. apply compression).
With the 27th September code base, this is what we have got: 5 second sample after 53050 steps on complete VCTK @ 0.001 LR
53050 steps but over complete VCTK corpus with 75000 sample size. Learning rate fixed at 0.001 for now.
Parameters.JSON: { "filter_width": 2, "sample_rate": 16000, "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2], "residual_channels": 32, "dilation_channels":32, "quantization_channels": 256, "skip_channels": 1024, "use_biases": true }
@ibab I'll try training on multiple speakers like the paper described and see if that gives any improvement.
And @jyegerlehner, that's the best sample I've heard yet, is it still ~2.0-2.1 loss?
@mortont The loss is around 1.7-1.9.
@mortont , @adroit91 : With regard to multiple speakers: I think it may be important to shuffle the speakers. I think the way it is now we mow down all 400-some files from a single speaker. I'm guessing it forgets about previous speakers by the time it gets to the end of the 400 clips.
Also, when DeepMind did multiple speakers, I think they provided the speaker ID as an input:
..in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input.
which I think might be important to get the transfer learning they alluded to as the speaker-specific qualities will be driven by the speaker embedding as in the un-numbered equation at bottom of page 4:
Interestingly, we found that training on many speakers made it better at modelling a single speaker than training on that speaker alone, suggesting a form of transfer learning.
They were clearly providing the speaker ID as an input when they trained on multiple speakers because they were able to select which speaker to generate the sample for in the samples under the heading "Knowing What To Say" on the blog post.
@ibab I still hold out hope for scalar input.
About comparison with DeepMind's results: 1) I think they trained on paragraphs instead of single sentences, such that the learnt model has breaks between sentences. Ours will never learn that with VCTK because it never sees gaps between sentences in the middle of a clip. In VCTK every clip is a only a single sentence. You can hear those breaks in the DeepMind results. 2) their recorded speech presumably had more conversational inflection in their voices. By contrast, the VCTK speakers seem to reading text in a monotone, which makes it a bit less interesting. 3) We don't know how many GPU farms they threw at this. If it's anything like AlphaGo it's a lot, and I don't see any reason why they would hold back here. We might do everything right, and not know it just because what took them a week to train might take us years of training (assuming we're just training on 1 or 2 GPUs).
And does anyone have a clear idea of the "context stack" scheme they are describing section 2.6 of the paper? I'd be interested to know if they were using context stack(s) when generating the samples on their blog.
Has somebody tried training the model with a music dataset? For example with this piano set: http://iwk.mdw.ac.at/goebl/mp3.html
I think it could be a good (and easy) way to check the model.
@jyegerlehner I should have clarified that by training on multiple speakers, I meant implementing what they refer to as "global conditioning", specifically on speaker identity (but could be extended in the future). I agree that shuffling input would help too.
As far as DeepMind's free-form speech generation, they mentioned in section 3.1 that they used the VCTK corpus for their experiments so I'm not sure how they would get paragraphs of speech unless they concatenated the samples or something. Doing that might not be a bad idea though, as a way to force gaps in the speech. I think the question boils down to how much preprocessing of the VCTK corpus they did, if any. You also have a good point about hardware, I think getting close to their results would be sufficient.
@mortont
As far as DeepMind's free-form speech generation, they mentioned in section 3.1 that they used the VCTK corpus for their experiments so I'm not sure how they would get paragraphs of speech unless they concatenated the samples or something.
Thanks! I completely missed that, obviously. I guess that kills that theory.
@adroit91's results above sound more similar to the DeepMind example in terms of speech to silence ratio. Maybe you've trained it without the silence trimming?
I've opened #104 for music-related discussion.
@ibab @jyegerlehner Good pointers about multiple speakers. I will try to randomize the order. I am at completely default settings. I had reduced the threshold for silence trimming earlier, but for this run I forgot to do that either! It's still at 0.3.
Also, I am looking at how to best enable multi GPU as I have 4 Titan X at hand. Would share code and results as soon as I have them, possibly as my first contribution to this project.
I kicked off a longer sample generation yesterday with a slightly later model and got 38 seconds out by now (not sure if the current fast generation with biases will be compatible with the older models). Sounds like how a child starts gurgling before speech! But, there's pretty much no noise behind them anymore! generated_59800_27Sep.wav.zip
Also observing that after around 80000 steps, the loss dropped to around 1.5 for a bit but grew back to around 2 again. I will also look at silence trimming based on maximum (or maybe RMS) amplitude of the audio to make the threshold adaptable. Also, perhaps look at other ways of describing the loss.
lalaalalalalalaaladlalalalalallalaooninadlalalalaoolalaooieinalalaalaaloosefoualalalalaaalalalalalalaalaallaallalalalalalalalallalalalallaalalalalanskalalalalaslalalalalaajuslalalalanasanisenalalalalalalalalalallalalalalalaalallalalalal...
fyi, this is my generated output from korean corpus(single female speaker). (I haven't raised receptive field size yet).
@nakosung That output file has been generated using the WaveNet model developed in this git? It sounds pretty good for me. Did you use the master branch? What parameters did you use?
I'm having a really hard time getting comprehensible results out of this-- I've tried the default settings and everything recommended here (added more layers, reduced training rate, etc), but the training will be going smoothly one moment, and then loss will explode and shoot back up to 5. I'm only getting what amounts to white noise or incoherent screeching even after 1,000,000 steps. Any thoughts on what I might be doing wrong?
This is the contents of wavenet_params.json:
{ "filter_width": 2, "sample_rate": 16000, "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2], "residual_channels": 32, "dilation_channels":32, "quantization_channels": 256, "skip_channels": 1024, "use_biases": true }
[edit] @nakosung Seconding Zeta here. How did you get those results? I haven't had any luck at all.
@Nyrt: The exploding loss is definitely something I've seen as well. I'm still not sure what's causing it, but the following things seem to help:
@ibab Might not the exploding loss be due to not shuffling the audio?
@Nyrt what learning rate are you using?
[Edit] I see you said default values. I think the default learning rate 0.02 is too high. I suggest trying 0.001 initially.
I don't say that the following is the best or good solution, but seems to be working OK: initial lr =0.001, then around 28K iterations drop to 0.0001, and around 50K drop to 0.00002. There's never enough time to explore the hyperparameter space :).
My latest with this is here. I think it sounds pretty good, but generates too much silence. Fussing with the silence trimming threshold seems to be the challenge at the moment. Also, I think you have a very small receptive field (as determined by your dilations).
@Zeta36 @Nyrt I've been playing with hyper parameters. Hyper parameter I used was 'default setting' plus smaller lr rate. After I found plateau on loss curve, I dropped lr-rate by /5, /10 from 0.002. :)
btw, I have medium-sized network-brewing environment(~20 nodes, 2x GPU) so playing around hyper-parameter space isn't so painful (with docker -H=remote-host gcr.io/tensorflow/tensorflow:0.10.0-gpu
!).
In my experience I think that i.i.d input is critical to the stable performance but current implementation's input doesn't seem so. As @jyegerlehner mentioned above, shuffling would help the performance.
This is the latest result (250ms. loss ~2.5) https://soundcloud.com/nako-sung/wavenet-korean-corpus-female-receptive-250msloss-25
I did test it using a lower learning rate, with the same problem. Hmm. Currently I've changed the activation function and optimizer and am using regularization, but I think I set the regularization too high.
Of note: Given how much success people have been having with learning rate annealing, it might be worth it to build that in as a feature.
Let's discuss strategies for producing audio samples. When running over the entire dataset, I've so far only managed to reproduce recording noise and clicks.
Some ideas I've had to improve on this:
librosa
.