ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.41k stars 1.29k forks source link

Generating good audio samples #47

Open ibab opened 7 years ago

ibab commented 7 years ago

Let's discuss strategies for producing audio samples. When running over the entire dataset, I've so far only managed to reproduce recording noise and clicks.

Some ideas I've had to improve on this:

jyegerlehner commented 7 years ago

I've got a laundry list of things I'd like to try and plan to explore the space of possibilities. Here are a few that come to mind.

  1. Paper mentions ~300mSec receptive field for speech generation at one point. Given our current params, we got closer to 63 mSec if my arithmetic is correct.
  2. Maybe we were a bit too draconian in cutting back dilation channels and residual channels. Bump those up?
  3. Does anyone else feel weird about performing a convolution on a time series where each element of the series is a one-hot vector, like we do at the input? I had thought their quantization into one-hot softmax output was only for the output, as the rationale involved avoiding having the learnt distribution putting probability mass outside the range of possible values. Encoding to one-hot on the input has to at the very least add quantization noise. I'd rather feed the input signal as a single floating point channel. Then the filters would be more like the digital filters we've always dealt with since the days of yore.
  4. That last 1x1 conv before the output has #channels = average of the dilation channels and the quantization levels. I'd rather make that a configurable number of channels, and bump it up to > quantization levels (e.g. 1024). We're trying to go from a small dense representation to choice from amongst 256 quantization levels, so it's almost like a classification problem where we need to create a complicated decision surface and thus maybe need more decision boundaries from more units, and maybe a bit deeper too.
  5. Issue 48.
  6. PR 39.
  7. Something else I can't remember at the moment.
ibab commented 7 years ago
  1. The ~300ms sounds like 4 or 5 stacks of dilation layers ranging from 1-512.
  2. I've also been thinking that the number of channels can now vary throughout the convolutional stack, as we're not tied to the number of input channels that we have when combining the outputs.
  3. That would conflict with the presence of the gated activation unit and the 1x1 convolution inside each of the layers, right? But I can see how one might build something useful with this idea. Basically, you would extract time-dependent activations at different time scales from the layers, and then feed them through several 1x1 layers to make sense of what you've seen at the different time scales. But you probably wouldn't want to add up the outputs of each layer, as in this architecture.
  4. Yeah, that makes perfect sense. The "postprocessing" layers probably aren't doing a lot at the moment.
jyegerlehner commented 7 years ago

2 I've also been thinking that the number of channels can now vary throughout the convolutional stack

Perhaps you're seeing something I'm not, but I don't see how the channels can vary. The res block's summation forces the input shape to equal the output shape, so num channels can't change. Oh, or perhaps you are saying within a single block, the channels can vary, so long as we end up with the same shape at input and output of any single block?

3 That would conflict with the presence of the gated activation unit and the 1x1 convolution inside each of the layers, right?

Err.. Sorry I probably wasn't expressing clearly what I intended. I don't see any conflict at all. Everything would be exactly as it is now. Except one thing: the input to the net would be n samples x 1 scalar amplitude, instead of n samples x 256 one-hot encoding. The initial causal conv filter will still produce the same number of channels it does now, so nothing downstream would see any impact.

ibab commented 7 years ago

2: Yeah, I was thinking about changing the number that's referred to as dilation_channels in the config file on a per-block basis, but got confused. This would require #48.

4: Ah, so basically we would allow the network to learn its own encoding of the floating point samples. Wouldn't that make the quantization procedure unnecessary?

lemonzi commented 7 years ago

In your one-hot proposal, we would be assuming that the network tries to learn how to perform a quantization. In the current implementation, the network learns how to encode a random variable with a multinomial distribution that has temporal dependencies across trials.

The obvious model that we would all like is using the float value with "classical" filters, but then we need to choose a loss function. The authors said that most loss functions on floats assume a particular distribution of the possible values (I think the squared loss corresponds to a normal distribution), while the multinomial from the one-hot encoding makes no assumptions at the expense of having a finite set of possible values. Apparently, lifting this constraint gives better results despite the quantization noise.

The downside is that the SNR will always be kinda high because of the 8-bit resolution, so at some point we should be able to find a better model -- or scale it up to a one-hot encoding with 60k+ categories.

El dl., 19 set. 2016 a les 17:14, jyegerlehner (notifications@github.com) va escriure:

  1. I've also been thinking that the number of channels can now vary throughout the convolutional stack

Perhaps you're seeing something I'm not, but I don't see how the channels can vary. The res block's summation forces the input shape to equal the output shape, so num channels can't change. Oh, or perhaps you are saying within a single block, the channels can vary, so long as we end up with the same shape at input and output of any single block?

  1. That would conflict with the presence of the gated activation unit and the 1x1 convolution inside each of the layers, right?

Err.. Sorry I probably wasn't expressing clearly what I intended. I don't see any conflict at all. Everything would be exactly as it is now. Except one thing: the input to the net would be n samples x 1 scalar amplitude, instead of n samples x 256 one-hot encoding. The initial causal conv filter will still produce the same number of channels it does now, so nothing downstream would see any impact.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/47#issuecomment-248129008, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5myUpJB0EKsftfed4CTAB2S3AZpzks5qrvtIgaJpZM4KAi-o .

jyegerlehner commented 7 years ago

@ibab

Wouldn't that make the quantization procedure unnecessary?

No, we'd still be using 1) softmax at the output to provide a discrete probability distribution over the 256 quantized values, 2) the quantization procedure for producing target output during training, and 3) and sampling from the discrete prob distribution of the softmax in order to produce the output (followed by the inverse of the companding quantization to get it back to a scalar amplitude).

@lemonzi

In your one-hot proposal...

I sincerely think what I'm proposing is being misunderstood. It's probably my fault :). I understand and like the rationale given for the softmax output in section 2.2 of the paper; I wasn't proposing getting rid of it. I don't have a one-hot proposal; the current code uses a one-hot encoding as an input.

But no matter; at some point perhaps I'll get around to trying what I propose, and bring it back here if it works any better. It does seem more likely, based on my reading of the paper, that it's what they are doing than what the current code does. I don't have high confidence though, and of course could easily be wrong.

lemonzi commented 7 years ago

This is what I understood:

- floating-point input with raw audio (no mu-law)
- convolution with 1 input channel and N output channels
- N-to-M channel layers
- layer that aggregates the skip connections with 256 outputs
- softmax
- cross-entropy against a mu-law + quantization encoding of the input

It's worth a shot, but the network is no longer an auto-encoder.

BTW, I don't like sampling from the multinomial in generate.py; I'd rather generate floats from a given distribution and quantize them, which is closer to feeding it a given raw audio seed.

@ibab, how about a tag for these "strategical issues" and one issue per idea?

lemonzi commented 7 years ago

@jyegerlehner Or maybe you meant mu-law encoding the input but skipping the quantization to avoid damaging the SNR?

jyegerlehner commented 7 years ago

This is what I understood

OK you understood me well then. Perhaps I was misunderstanding you.

It's worth a shot, but the network is no longer an auto-encoder.

I'm trying to see in what sense it was ever an auto-encoder. I don't think it is/was.

BTW, I don't like sampling from the multinomial in generate.py; I'd rather generate floats from a given distribution and quantize them

Not sure I follow your alternative to the softmax. I was mostly trying to stick to figuring out what the authors had most likely done in their implementation. I bet we all have ideas about different approaches we'd like to try out.

Or maybe you meant mu-law encoding the input but skipping the quantization to avoid damaging the SNR?

No I never thought of that.

how about a tag for these "strategical issues" and one issue per idea?

Right, I feel bad about hijacking ibab's thread. I like the strategic issues tag idea. I prefer not to clutter this thread any more with this one topic.

ibab commented 7 years ago

As the topic of this issue is just a general "What else should we try?" I think the discussion is perfectly fine 👍 But feel free to open new issues to discuss strategies. I can tag them with a "strategy" label.

jyegerlehner commented 7 years ago

We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset.

Right. I also think when we train on multiple speakers we need to shuffle the clips. I fear we may be experiencing catastrophic forgetting. That little sine-wave unit test I wrote shows how quickly it can learn a few frequencies, which makes me think once it starts getting sentences from a single speaker, it forgets about the pitch and other characteristics of the previous speaker(s).

But single-speaker training is less ambitious an easier first step.

woodshop commented 7 years ago

My two cents:

Scalar input: WaveNet treats audio generation as an autoregressive classification task. The model requires the last step's output to be provided at the input. I don't think there's much to be gained by providing scalar floating point values at the input. They would still need to be reduced to 8-bit resolution (or as @lemonzi mentions you'd be asking the model to learn quantization). You might save some computational cycles at the first layer. However I think then the scale of the input would need to be considered more closely.

Input shuffling: this would probably be very useful.

Silence trimming: Shouldn't the model be allowed to see strings of silent samples? Otherwise it will learn to generate waveforms having more discontinuities. I suggest that the degree of trimming is decided as a function of the size of the receptive field. E.g. truncate silences to no less than 75% of the receptive field.

lemonzi commented 7 years ago

Oh, that makes sense. It's classifying the next sample, not encoding the sequence as a whole.

The trimming is currently applied to the beginning and end of the samples, not to the gaps in between speech. If there are long silence chunks in the samples, what could make sense is to split them in two rather than stripping out the silence.

ibab commented 7 years ago

I've just managed to generate a sample that I think sounds pretty decent: https://soundcloud.com/user-952268654/wavenet-28k-steps-of-100k-samples

This is using Tanh instead of ReLU to avoid the issue that the ReLU activations eventually cut off the network. I stopped it at one point to reduce the learning rate from 0.02 to 0.01 but it doesn't look like it had a large impact. I started generating when the curve was at about 28k steps.

screen shot 2016-09-21 at 21 08 46

I used only two stacks of 9 dilation layers each:

{
    "filter_width": 2,
    "quantization_steps": 256,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256],
    "residual_channels": 32,
    "dilation_channels":16,
    "use_biases": false
}
woodshop commented 7 years ago

Nice!

ibab commented 7 years ago

I've noticed that generating from the same model doesn't always produce interesting output. But if I start off with an existing recording, it seems to work very reliably: https://soundcloud.com/user-952268654/bootstrapping-wavenet-with-an-existing-recording

Considering that the receptive field of this network is only ~1000 samples, I think the results sound quite promising.

lemonzi commented 7 years ago

Can you test with argmax instead of random . choice?

On Wed, Sep 21, 2016, 17:46 Igor Babuschkin notifications@github.com wrote:

I've noticed that generating from the same model doesn't always produce interesting output. But if I start off with an existing recording, it seems to work very reliably:

https://soundcloud.com/user-952268654/bootstrapping-wavenet-with-an-existing-recording

Considering that the receptive field of this network is only ~1000 samples, I think the results sound quite promising.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/47#issuecomment-248752588, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5jeApDDLwiBeK-zYyZDdEmKvPTs0ks5qsaWkgaJpZM4KAi-o .

ibab commented 7 years ago

@lemonzi: After swapping out random.choice with argmax, it always returns the same value. I think that makes sense, as staying at the same amplitude is the most likely thing to happen at the resolution we work with.

lemonzi commented 7 years ago

Interesting...

El dc., 21 set. 2016 a les 17:58, Igor Babuschkin (notifications@github.com) va escriure:

@lemonzi https://github.com/lemonzi: After swapping out random.choice with argmax, it always returns the same value. I think that makes sense, as staying at the same amplitude is the most likely thing to happen at the resolution we work with.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/47#issuecomment-248755247, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5lUKLgy9yHbvna-8BQFLuht0GWvyks5qsah1gaJpZM4KAi-o .

ibab commented 7 years ago

When calculating the mean amplitude with

sample = np.int32(np.sum(np.arange(quantization_steps) * prediction))

it just produces noise for me.

ghenter commented 7 years ago

Very cool work, guys! As a text-to-speech person, I am excited to see where this effort may lead.

As far as generating good-sounding output, I believe I have some thoughts to add regarding point 3 in @jyegerlehner's list, on the use of floating point values vs. one-hot vectors for the network inputs. I hope this is the right issue in which to post them.

I met with Heiga Zen, one of the authors of the WaveNet paper, at a speech synthesis workshop last week. I quizzed him quite a bit on the paper when I had the chance. My understanding is that there are two key motivations for using (mu-law companded) one-hot vectors for the single-sample network output:

  1. This turns the problem from a regression task to a classification task. For some reason, DNNs have seen greater success in classification than in regression. (This has motivated the research into generative adversarial networks, which is another hot topic at the moment.) Up until now, most DNN-based waveform/audio generation approaches were formulated as regression problems.
  2. A softmax output layer allows a flexible representation of the distribution of possible output values, from which the next value is generated by sampling. Empirically, this worked better than parametrising the output distribution using GMMs (i.e., a mixture density network).

Note that both these key concerns only are relevant at the output layer, not at the input layer. As far as the input representation goes, scalar floating-point values have several advantages over a one-hot vector discrete representation:

Seeing that WaveNet is based on PixelCNNs, it might be instructive to consider how the latter handle and encode their inputs There appears to be a working implementation of pixelCNNs on GitHub, but I haven't looked sufficiently deeply into it to tell how they encode their input.

jyegerlehner commented 7 years ago

Has everyone been reproducing ibab's results? I got a result similar to his, but I think it sounds a bit smoother; I'm guessing because the receptive field is a little bigger than his.

2 seconds: https://soundcloud.com/user-731806733/speaker-p280-from-vctk-corpus-1

10 seconds: https://soundcloud.com/user-731806733/speaker-280-from-vctk-corpus-2

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2],
    "residual_channels": 32,
    "dilation_channels":32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true
}

[Edit] After mortont comment below: I used learning_rate=0.001.

I made a copy of the corpus directory, except I only copied over the directory for speaker p280. I stopped training at about 28K steps, to follow ibab's example. Loss was a bit lower than his, around 2.0-2.1.

I think to get pauses between words and such we need to a wider receptive field. That's my next step.

By the way, anyone know how to make soundcloud loop the playback, instead of playing music at the end of the clip, like ibab did? Pro account needed for that?

[Edit] Here's one from a model with that has about 250 mSec receptive field, trained for about 16 hours: https://soundcloud.com/user-731806733/generated-larger-1

ibab commented 7 years ago

Those results sound great! We should consider linking to them from the README.md to demonstrate what the network can do. It seems likely that we will be able to reproduce the quality of the DeepMind samples with a higher receptive field.

On soundcloud, you can set an audio clip to repeat in the bar at the bottom, but I don't think this will affect other listeners. Not sure why my clip was on repeat by default for you.

mortont commented 7 years ago

This is definitely the best result yet! What commit did you use to achieve this @jyegerlehner? I tried reproducing it using the same hyperparameters and only speaker p280 from the corpus, but my model hasn't gone under a loss of 5 after 26k steps.

jyegerlehner commented 7 years ago

@mortont I'm not sure exactly which commit to this branch it was exactly: https://github.com/jyegerlehner/tensorflow-wavenet/tree/single-speaker But most are trivial and frankly I don't think it matters. I haven't observed it breaking at any point.

I've started training a newer model with latest updates from master and it is working fine. I don't have any "special sauce" or changes to the code relative to master that I can think of. The only reason for a separate branch for it is to allow me to change the .json file and add shell scripts, and be able to switch back to master without losing files.

I'm trying to imagine why you would have loss stuck at 5 and... can't think of a good reason. Perhaps compare the train.sh and resume.sh in my branch to the command-line arguments you are supplying and see if there's an important difference? Learning rate perhaps?

[Edit]: I observe the loss to start dropping right away, within the first 50 steps. Loss drops to < 3 rapidly well before 1K steps. So if you don't see that, I think something is wrong.

mortont commented 7 years ago

Looks like it was learning rate, I changed it from 0.02 to 0.001 and it's now steadily dropping, thanks!

mortont commented 7 years ago

I also got some good results using the following wavenet_params.json only training on speaker p280:

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true
}

with a learning rate of 0.001. I trained it for ~15k steps with a sample size of 70k https://soundcloud.com/travis-morton-654696702/mortont-generated-audio

ucasyouzhao commented 7 years ago

good job!

mortont commented 7 years ago

Another hyperparameter datapoint: https://soundcloud.com/travis-morton-654696702/mortont-generated-audio2 This clip was generated using the following wavenet_params.json

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true
}

and used a seed to kick off generation (p280_001.wav specifically). The model was trained for ~12k steps only on speaker p280 and got to a loss of ~2.1 with learning rates of 0.001, 0.0005, and 0.0002 in that order. Sample size was 100k and receptive field should be 328ms.

I'm thinking the receptive field is large enough now, but this still doesn't sound nearly as good as the DeepMind examples. Maybe increasing residual, dilation, and skip channels further would help. Or maybe it needs to train on multiple speakers instead of just one. Thoughts?

ibab commented 7 years ago

@mortont: Yeah, I think training on multiple speakers (and using the conditioning technique from the paper) might make a difference. We should also try leaving some of the silence between the samples, as the DeepMind results sound less "busy" than the ones we are generating. Regularization might also make a difference in how well we can train the network.

jyegerlehner commented 7 years ago

The paper did mention that training on multiple speakers improved each single speaker's quality.

Here's one with 500 mSec receptive field, as of about 40K training steps (speaker 280). https://soundcloud.com/user-731806733/tensorflow-wavenet-500-msec

I trained with learn rate=0.001 up to about 28K then switched to learn_rate=0.0001. I think it's still improving, and will switch it to learn_rate= 0.00001 in a while.

{ "filter_width": 2, "sample_rate": 16000, "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512], "residual_channels": 32, "dilation_channels":32, "quantization_channels": 256, "skip_channels": 1024, "use_biases": true }

ibab commented 7 years ago

@jyegerlehner: That's definitely the best sample I've heard so far. Maybe we can improve the result if we try processing the input a bit (e.g. apply compression).

adroit91 commented 7 years ago

With the 27th September code base, this is what we have got: 5 second sample after 53050 steps on complete VCTK @ 0.001 LR

53050 steps but over complete VCTK corpus with 75000 sample size. Learning rate fixed at 0.001 for now.

Parameters.JSON: { "filter_width": 2, "sample_rate": 16000, "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2], "residual_channels": 32, "dilation_channels":32, "quantization_channels": 256, "skip_channels": 1024, "use_biases": true }

mortont commented 7 years ago

@ibab I'll try training on multiple speakers like the paper described and see if that gives any improvement.

And @jyegerlehner, that's the best sample I've heard yet, is it still ~2.0-2.1 loss?

jyegerlehner commented 7 years ago

@mortont The loss is around 1.7-1.9.

@mortont , @adroit91 : With regard to multiple speakers: I think it may be important to shuffle the speakers. I think the way it is now we mow down all 400-some files from a single speaker. I'm guessing it forgets about previous speakers by the time it gets to the end of the 400 clips.

Also, when DeepMind did multiple speakers, I think they provided the speaker ID as an input:

..in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input.

which I think might be important to get the transfer learning they alluded to as the speaker-specific qualities will be driven by the speaker embedding as in the un-numbered equation at bottom of page 4:

Interestingly, we found that training on many speakers made it better at modelling a single speaker than training on that speaker alone, suggesting a form of transfer learning.

They were clearly providing the speaker ID as an input when they trained on multiple speakers because they were able to select which speaker to generate the sample for in the samples under the heading "Knowing What To Say" on the blog post.

@ibab I still hold out hope for scalar input.

About comparison with DeepMind's results: 1) I think they trained on paragraphs instead of single sentences, such that the learnt model has breaks between sentences. Ours will never learn that with VCTK because it never sees gaps between sentences in the middle of a clip. In VCTK every clip is a only a single sentence. You can hear those breaks in the DeepMind results. 2) their recorded speech presumably had more conversational inflection in their voices. By contrast, the VCTK speakers seem to reading text in a monotone, which makes it a bit less interesting. 3) We don't know how many GPU farms they threw at this. If it's anything like AlphaGo it's a lot, and I don't see any reason why they would hold back here. We might do everything right, and not know it just because what took them a week to train might take us years of training (assuming we're just training on 1 or 2 GPUs).

jyegerlehner commented 7 years ago

And does anyone have a clear idea of the "context stack" scheme they are describing section 2.6 of the paper? I'd be interested to know if they were using context stack(s) when generating the samples on their blog.

Zeta36 commented 7 years ago

Has somebody tried training the model with a music dataset? For example with this piano set: http://iwk.mdw.ac.at/goebl/mp3.html

I think it could be a good (and easy) way to check the model.

mortont commented 7 years ago

@jyegerlehner I should have clarified that by training on multiple speakers, I meant implementing what they refer to as "global conditioning", specifically on speaker identity (but could be extended in the future). I agree that shuffling input would help too.

As far as DeepMind's free-form speech generation, they mentioned in section 3.1 that they used the VCTK corpus for their experiments so I'm not sure how they would get paragraphs of speech unless they concatenated the samples or something. Doing that might not be a bad idea though, as a way to force gaps in the speech. I think the question boils down to how much preprocessing of the VCTK corpus they did, if any. You also have a good point about hardware, I think getting close to their results would be sufficient.

jyegerlehner commented 7 years ago

@mortont

As far as DeepMind's free-form speech generation, they mentioned in section 3.1 that they used the VCTK corpus for their experiments so I'm not sure how they would get paragraphs of speech unless they concatenated the samples or something.

Thanks! I completely missed that, obviously. I guess that kills that theory.

ibab commented 7 years ago

@adroit91's results above sound more similar to the DeepMind example in terms of speech to silence ratio. Maybe you've trained it without the silence trimming?

I've opened #104 for music-related discussion.

adroit91 commented 7 years ago

@ibab @jyegerlehner Good pointers about multiple speakers. I will try to randomize the order. I am at completely default settings. I had reduced the threshold for silence trimming earlier, but for this run I forgot to do that either! It's still at 0.3.

Also, I am looking at how to best enable multi GPU as I have 4 Titan X at hand. Would share code and results as soon as I have them, possibly as my first contribution to this project.

I kicked off a longer sample generation yesterday with a slightly later model and got 38 seconds out by now (not sure if the current fast generation with biases will be compatible with the older models). Sounds like how a child starts gurgling before speech! But, there's pretty much no noise behind them anymore! generated_59800_27Sep.wav.zip

Also observing that after around 80000 steps, the loss dropped to around 1.5 for a bit but grew back to around 2 again. I will also look at silence trimming based on maximum (or maybe RMS) amplitude of the audio to make the threshold adaptable. Also, perhaps look at other ways of describing the loss.

1j01 commented 7 years ago

lalaalalalalalaaladlalalalalallalaooninadlalalalaoolalaooieinalalaalaaloosefoualalalalaaalalalalalalaalaallaallalalalalalalalallalalalallaalalalalanskalalalalaslalalalalaajuslalalalanasanisenalalalalalalalalalallalalalalalaalallalalalal...

nakosung commented 7 years ago

fyi, this is my generated output from korean corpus(single female speaker). (I haven't raised receptive field size yet).

https://soundcloud.com/nako-sung/67200a

Zeta36 commented 7 years ago

@nakosung That output file has been generated using the WaveNet model developed in this git? It sounds pretty good for me. Did you use the master branch? What parameters did you use?

Nyrt commented 7 years ago

I'm having a really hard time getting comprehensible results out of this-- I've tried the default settings and everything recommended here (added more layers, reduced training rate, etc), but the training will be going smoothly one moment, and then loss will explode and shoot back up to 5. I'm only getting what amounts to white noise or incoherent screeching even after 1,000,000 steps. Any thoughts on what I might be doing wrong? This is the contents of wavenet_params.json: { "filter_width": 2, "sample_rate": 16000, "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2], "residual_channels": 32, "dilation_channels":32, "quantization_channels": 256, "skip_channels": 1024, "use_biases": true }

generated_samples.tar.gz

[edit] @nakosung Seconding Zeta here. How did you get those results? I haven't had any luck at all.

ibab commented 7 years ago

@Nyrt: The exploding loss is definitely something I've seen as well. I'm still not sure what's causing it, but the following things seem to help:

woodshop commented 7 years ago

@ibab Might not the exploding loss be due to not shuffling the audio?

jyegerlehner commented 7 years ago

@Nyrt what learning rate are you using?

[Edit] I see you said default values. I think the default learning rate 0.02 is too high. I suggest trying 0.001 initially.

I don't say that the following is the best or good solution, but seems to be working OK: initial lr =0.001, then around 28K iterations drop to 0.0001, and around 50K drop to 0.00002. There's never enough time to explore the hyperparameter space :).

My latest with this is here. I think it sounds pretty good, but generates too much silence. Fussing with the silence trimming threshold seems to be the challenge at the moment. Also, I think you have a very small receptive field (as determined by your dilations).

nakosung commented 7 years ago

@Zeta36 @Nyrt I've been playing with hyper parameters. Hyper parameter I used was 'default setting' plus smaller lr rate. After I found plateau on loss curve, I dropped lr-rate by /5, /10 from 0.002. :)

btw, I have medium-sized network-brewing environment(~20 nodes, 2x GPU) so playing around hyper-parameter space isn't so painful (with docker -H=remote-host gcr.io/tensorflow/tensorflow:0.10.0-gpu!).

In my experience I think that i.i.d input is critical to the stable performance but current implementation's input doesn't seem so. As @jyegerlehner mentioned above, shuffling would help the performance.

This is the latest result (250ms. loss ~2.5) https://soundcloud.com/nako-sung/wavenet-korean-corpus-female-receptive-250msloss-25

Nyrt commented 7 years ago

I did test it using a lower learning rate, with the same problem. Hmm. Currently I've changed the activation function and optimizer and am using regularization, but I think I set the regularization too high.

Of note: Given how much success people have been having with learning rate annealing, it might be worth it to build that in as a feature.