Generating good audio samples

ibab commented 7 years ago

Let's discuss strategies for producing audio samples. When running over the entire dataset, I've so far only managed to reproduce recording noise and clicks.

Some ideas I've had to improve on this:

We should limit ourselves to a single speaker for now. That will allow us to perform multiple epochs on the train dataset. We could also try overfitting the dataset a little, which should result in the network reproducing pieces of the train dataset.
Remove silence from the recordings. Many of the recordings have periods of recording noise before and after the speakers. It might be worth removing these with librosa.

jyegerlehner commented 7 years ago

@Nyrt So are you learning on the full corpus? See the above discussion about single-speaker vs full corpus training. I'd expect error to jump up whenever it switches speakers. I think our good results so far have all been on single-speaker training sets. And I think your receptive field is small.

This is my latest result which I think is the best I've got so far, at about 88K steps, 500 msec receptive field, stepwise decaying learn rate, on speaker 280. Loss is ~1.7-1.9.

https://soundcloud.com/user-731806733/tensorflow-wavenet-500-msec-88k-train-steps

I think we're getting closer to DeepMind's results.

nakosung commented 7 years ago

@jyegerlehner Have you seen #98? @JesseYang suspects loss around ~2.0 is due to error induced by padded zero.

rockyrmit commented 7 years ago

@jyegerlehner very encouraging results! i'm curious which result in the wavenet paper did you use to compare or how to compare? - as those deepmind results are clear voice with sentence- but the result you posted is some random sound without any meaning. thanks!

rockyrmit commented 7 years ago

@jyegerlehner what's your email btw, would love to learn more details. xinjie.yu@gmail.com

nakosung commented 7 years ago

@rockyrmit Please refer to #112.

jyegerlehner commented 7 years ago

@rockyrmit I was referring to the samples on the DeepMind blog page under the heading "Knowing What To Say" where the model just generates things that sound like babbling. Those were the only results I was referring to; not the ones where they do full TTS and condition on speaker. That's all I was referring to. I don't mean that we're close to reproducing all their results. It was the easiest first step for us.

As far as learning all the details: I and others have posted our configuration in various threads here. I'm not using any special code. It's all in the open.

rockyrmit commented 7 years ago

thanks @jyegerlehner @nakosung !

an update on this topic: #112

neale commented 7 years ago

Hey guys, I guess this is the thread to start a conversation about music. I've been trying my hand at generating classical music with a 12GB dataset of famous composers (mostly piano). Here's what I got, I can upload them if anyone wants. SoundCloud Link

I made several models and tried different amounts of data, more seems to be better, and longer training times obviously pay off

I noticed a couple things, besides how functionally terrible the samples are. There's a huge amount of noise that doesn't seem to decrease with training time or data. That might be a failure to model what sounds like discord in classical music.

Also these models don't seem to converge -strictly speaking. After a 30K steps I still get frequent oscillation in the loss of over 2.0 per step even with LR decay. So I'm not really sure how to evaluate the model.

Here's my wavenet_params.json

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 512,
    "use_biases": true,
    "scalar_input": false,
    "initial_filter_width": 32
}

With a learning rate of 0.001 and sgd. If anyone has any insight I'd love to talk. I'm excited by this project, especially about how it isn't finished yet

Nyrt commented 7 years ago

Okay, finally some real progress! Using the new default wavenet params, only one speaker, and the momentum optimizer, I've gotten these results:

https://soundcloud.com/underwhelming-force/sets/wavenet-experimentation-iii/s-46gyC

Learning rate started at 0.001 and decreased by an order of magnitude each test. The tests had varying amounts of iterations in-between, just stopping whenever it appeared to plateau (or when I ran out of memory. Those events.out get huge!)

Time to expand the receptive field and see if results improve.

neale commented 7 years ago

Nice! Which speaker did you train on @Nyrt I've noticed some speakers perform better than others.

robinsloan commented 7 years ago

@neale, this is cool! It would be interesting, with this output as a baseline, to now take a bundle of pieces from the same composer -- doesn't have to be a lot -- and train the network on those alone, with the same settings, same procedure, etc. I'd be very curious to hear that output alongside what you've got.

My sense is that training this WaveNet implementation on a large, diverse corpus is going to be tricky until we have a method for "conditioning" and telling the network that A is supposed to sound like B and C, but not as much like Y and Z, etc. Otherwise it just tries to generalize across the entire breadth of what it's hearing, and that's a lot to ask.

Question for everyone: what's the math to compute the length of the receptive field for a given set of params/dilations? I know I should know this but… I do not 😬

jyegerlehner commented 7 years ago

@Nyrt those sound pretty good. What was your loss approximately when you generated those?

And regarding the optimizer: are you using SGD/momentum optimizer for any particular reason? I haven't seen it learning as fast as adam or rmsprop, but your results are hard to argue with.

neale commented 7 years ago

@robinsloan I've reread the paper and realized that classical music is absolutely beyond the ability of the model. With a 300ms receptive field, the multiple instruments probably sound just like what I posted. A more homogeneous dataset is needed, I'm getting together as much solo piano as I can find.

Also the receptive field size is just the size of the convolution window. So in a regular CNN people usually use 3x3 or 5x5, two dimensional receptive fields. We use 1D conv layers, and with the dilations the receptive fields get sparse by multiplying the length by [1, 2, 4, 8, ..] and zeroing portions of the filter. There's no math for that :) , math would be calculating parameters introduced from each convolution

neale commented 7 years ago

Also can someone edify me as to why anything but rmsprop is being used. I thought it would perform the best here.

robinsloan commented 7 years ago

@neale Er, I guess I mean, how do you know your receptive field is 300ms long? I get that it depends on sample rate and dilations, but I don't understand the arithmetic.

mortont commented 7 years ago

@robinsloan the receptive field length (in seconds) is just the sum of your dilation layers (receptive field in unit-less numbers) divided by your sample rate (1/seconds), so in the case of a wavenet_params.json that looks like this:

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16, 32, 64, 128, 256,
                  1, 2, 4, 8, 16],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 512,
    "use_biases": true,
    "scalar_input": false,
    "initial_filter_width": 32
}

your receptive field would be the sum of the dilations list (1564) divided by the sampling frequency (16000) giving you ~98ms of a receptive field.

Somewhat related, @neale it may be worth trying with a longer receptive field than 100ms since the best speech samples have been in the 300ms+ range. The paper mentions fields >1 second for the music generation, but I think we could get results better than static if we upped the dilations to 6 blocks of 512 or so.

neale commented 7 years ago

@mortont I would love to but I can't even have a full [1..512] stack because I only have a gtx 970 :(

Nyrt commented 7 years ago

@neale I believe it was speaker 266-- I can confirm this once I get back to my big machine.

@jyegerlehner The loss was still oscillating a bit, but the minimum it hit was something around 1.5-1.6. Most of the time it was closer to 1.7-1.8, peaking at 2.

The reason I was using the SGD/momentum optimizer was that it avoids an issue I was having early on where the loss would suddenly explode and start generating white noise. The slower training wasn't really an issue because it was running overnight anyway. I haven't tried RMSprop yet, though.

neale commented 7 years ago

In the quest for ever better audio I made some samples that sound a lot better than anything I could get before.

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1014, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096,
                  1, 2, 4, 8, 16, 32, 64],

    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 512,
    "use_biases": true,
    "scalar_input": false,
    "initial_filter_width": 32
}

unmentioned: rmsprop and initial LR of 0.001

Soundcloud Link

I used a receptive field of 2s, on some 20 hours of solo piano that I scraped off youtube. I trained this out to 100k steps, and those are still generating. Unfortunately I had to decrease the sample size to 32000 for lack of available memory.

I just grabbed a 8GB 1070, so in a few days I'll triple my model and try it out.

nakosung commented 7 years ago

I just grabbed some P100's and have trained large wavenet. :) (Trained with current default wavenet_params.json)

https://soundcloud.com/nako-sung/test

Cortexelus commented 7 years ago

Nako, What was dataset size (#files, length of files)? For how long did you train?

On Friday, November 11, 2016, Nako Sung notifications@github.com wrote:

I just grabbed some P100's and have trained large wavenet. :) Trained with current default wavenet_params.json.

https://soundcloud.com/nako-sung/test

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/47#issuecomment-259929229, or mute the thread https://github.com/notifications/unsubscribe-auth/ACVxZXm4OsBJGIrOTrTF8h4_POd4-BXGks5q9ERvgaJpZM4KAi-o .

nakosung commented 7 years ago

@Cortexelus 158 files, 200K bytes each. ~50k iteration with learning rate annealing. :)

Cortexelus commented 7 years ago

@nakosung at what sample rate?

fehiepsi commented 7 years ago

In the paper, the receptive field is about 300ms and the sample rate is 16000. So the current default setting is good in my opinion; except something like sample size, number of iterations, learning rate, optimizer, which we can tweak through experiments to avoid overfitting and get better convergence. To generate a good sound, we should consider the local conditioning on text e.g. (as mentioned in Section 3.1 of the paper).

nakosung commented 7 years ago

@Cortexelus default setting(16K)

New sample with slience. https://soundcloud.com/nako-sung/test-3-wav

nakosung commented 7 years ago

I've just added dropout and concat-elu (https://github.com/ibab/tensorflow-wavenet/pull/184). I hope dropout would help generator's quality. :)

willjhenry commented 7 years ago

Hi all, I am using the default settings and on an ec2 p2.xlarge and am getting 2.7 sec/step. I am pretty certain I saw a post on one of the issues describing 0.5 sec steps. Just wondering if anyone had some tips

nakosung commented 7 years ago

@willjhenry In my case I got 3.5 sec/step using KAIST-Korean corpus. (IBM PowerPC, P100)

Whytehorse commented 7 years ago

Is the goal here to produce random speech? If so, it sounds like nakosung nailed it. What about text-to-speech? I'd love to start using this in OpenAssistant for our tts. Please help us? Festival is so poor... https://github.com/vavrek/openassistant

jyegerlehner commented 7 years ago

@Whytehorse

Generating random speech is a step on the way to generating speech conditioned on what you want to it to say (i.e. generating non-random speech). Please see the WaveNet paper.

I don't think this project, at least in its current form, is a candidate for your tts solution. 1) there's no local conditioning on the desired speech implemented in the repo, at least yet. 2) Even if it were, it doesn't generate audio in real-time. Takes M seconds to generate N seconds of audio, where M >> N.

jyegerlehner commented 7 years ago

@willjhenry @nakosung On the time-per-step: the code in the master of this repo produces a training batch by pasting together subsequent audio clips from the corpus until the length is at least sample_size=100000 samples long. I think this is wrong because it is training the net to predict a discontinuity where it transitions from the first to second clip. This is fixed in both the global condition PR and koz4k's PR. Since most of the VCTK corpus clips are much shorter than 100000 samples, these branches produce faster step times as the number of samples in the train batch tends to be smaller (than what you get with the ibab master).

Having said all that, I was getting about 0.5-1.0 second steps on my branch, on Titan XP.

belevtsoff commented 7 years ago

@nakosung That's a really cool sample you've got there: https://soundcloud.com/nako-sung/test-3-wav. I noticed that your model produces very smooth-sounding vowels. I trained my model with 50k+ steps (current default parameters) and still have some considerable tremor in the vowels: https://soundcloud.com/belevtsoff/wavenet_54k_audiobook. My training corpus is an audiobook with 1+ hours of clean speech. Also, I use RandomShuffleQueue for feeding input data. Can you think about a possible reason for this poor quality?

Whytehorse commented 7 years ago

Can't you just use google or siri to produce the corpus? Maybe through an API you could send a word as text to them and get back a sound. There's your training data. Eventually you could get longer and longer sentences until it can handle anything.

AlvinChen13 commented 7 years ago

@nakosung Have you trained on Mutiple GPU with Mutiple nodes? The default training just run one GPU. Do you mind share how to train it on multiple nodes?

nakosung commented 7 years ago

@AlvinChen13 Multi node training didn't scale well. I think it has bottleneck in network bandwidth. (Although I didn't test ASGD) I switched to TitanXP which had larger memory so I stopped multi node/gpu experiment.

AlvinChen13 commented 7 years ago

@nakosung Would you mind share your distributed code? Our lab have 8 nodes with 2 M40 for each, and 4 nodes with 2 K40 for each, and all are connected to 40G switch.

I beleive multi-node training is neccessary if we train on huge dataset with bigger receptive fields. It is worth to cost some efforts to investigate it. Google claimed that tf 1.0 can achieve 58x performance improving with 64 GPU for inception v3.

ibab / tensorflow-wavenet

Generating good audio samples #47