ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.41k stars 1.3k forks source link

44.1K Sample Rate Strategies #124

Open chrisnovello opened 7 years ago

chrisnovello commented 7 years ago

Here is a 44.1k sample rate clip, trained ~50k steps on VCTK speaker 280 (with a 100k sample size).

Any suggestions for how to improve it? I'm noticing:

Settings I used were the ones @jyegerlehner posted here: https://github.com/ibab/tensorflow-wavenet/issues/47#issuecomment-250218850

{
"filter_width": 2,
"sample_rate": 44100,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
"residual_channels": 32,
"dilation_channels":32,
"quantization_channels": 256,
"skip_channels": 1024,
"use_biases": true
}
jyegerlehner commented 7 years ago

@paperkettle Thanks, interesting. What's your learning rate and silence trimming history?

jyegerlehner commented 7 years ago

this clip sounds like it has a repetitive short window

@paperkettle OK one observation: that config you cited has a ~500 msec receptive field only at 16kHz. At 44kHz, it would have ~180 mSec receptive window. The receptive field size is determined by the dilations (and filter width=2). But that determines the filter receptive size in terms of number of samples; to get to wall-clock time you have to take into account the sample rate. I think that probably explains the "short repetitive window" aspect of the result.

Cranking up the dilations to get to 500 msec at 44.1 kHz would require a really deep and (memory-wise) large net. You could try that. I suspect to get to larger receptive fields (in terms of samples) efficiently we would need to implement context stacks as described in section 2.6 of the paper.

jyegerlehner commented 7 years ago

@paperkettle Sorry to spam you with so many messages, but another observation. There's a default command line argument sample_size = 100000 into which the input gets chopped up during training, if the length exceeds that number. That's a little over 2 seconds at a 44.1 kHz sample rate. Most audio clips are longer than that. I would try turning that off completely by specifying --sample_size 0 on the command-line.

shiba24 commented 7 years ago

Thank you for the issue. Actually I am also faced with the same problem, though for another dataset. So the dilations, filter width and sampling frequency determine the "receptive field" [ms]: Is there any formula for that? Thank you in advance.

lemonzi commented 7 years ago

What's with the clicks? Any idea?

44.1 kHz is not that important for TTS (the energy above 8-10 kHz is very low anyway), but it's good that we benchmark it at that sample rate for when we want to train on music.

It would be good to see if we can achieve decent results at 10 bits as well (1024 levels seems reasonable, 16 bits is probably too much).

El dj., 6 oct. 2016 a les 1:12, Chris Novello (notifications@github.com) va escriure:

https://soundcloud.com/paperkettle/wavenet-441k-sample-rate-vctk-speaker-280-50k-steps

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/124, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5p2a3LRj4xi4jTxPIp-3t9Y4ZvS4ks5qxINGgaJpZM4KPjlI .

jyegerlehner commented 7 years ago

@shiba24

the dilations, filter width and sampling frequency determine the "receptive field" [ms]: Is there any formula for that?

You can work it out with a paper and pencil from figure 3 of the paper so as to have an intuition about it. Or search for "compute_receptive_field_size" in this source file: https://github.com/jyegerlehner/tensorflow-wavenet/blob/skip-receptive-field/wavenet/model.py

chrisnovello commented 7 years ago

@jyegerlehner — spam away, it's very helpful to get feedback! Learning rate is .001 (didn't change it with steps either). Silence trimming is default.

"I would try turning that off completely by specifying --sample_size 0 on the command-line."

Ah, good yes will try. Still wrapping my head around receptive field <-> sample rate & size <-> net behavior — haven't known how to think about the chunking.

I knew I had cut the receptive field and figured that was related to the short window sound (the current output reminds of granular synths). As a result, I tried to do a training pass with a 275000 sample size (not looking at the formula, I just naively scaled the sample size along with my jump from 16000 to 44100). That was when I hit what I think were memory errors (running a 1080 GTX which has 8 gigs), though it was late and I will test again and look more closely. It successfully ran around 125000 sample size.

Let's say I feed a single long audio file in with a sample_size 0 — won't I bump into the same issue?

If no, under what circumstances would one want to specify a sample size? Just to tune the network into looking at specific lengths of time in relation to the content one is getting it to learn?

Or asked differently, what is the design rationale for chopping up inputs..? To be able to dump a massive folder of varied data and ensure that it looks at as many different files as possible?

Have some more abstract questions but going to give the paper & codebase some time this evening while I run some more tests before asking.

shiba24 commented 7 years ago

@jyegerlehner thank you very much! :)

nakosung commented 7 years ago

@paperkettle For RNN's truncated BPTT is common due to GPU memory limit.

jyegerlehner commented 7 years ago

Let's say I feed a single long audio file in with a sample_size 0 — won't I bump into the same issue?

Yes probably. The sample_size "chunking" was put in for memory reasons.

want to specify a sample size

If you don't specify one at all, you get the default which is 100000 samples.

Or asked differently, what is the design rationale for chopping up inputs..? To be able to dump a massive folder of varied data and ensure that it looks at as many different files as possible?

I don't see how chunking helps performance. It probably hurts because more of the training data, as a fraction of the whole, is training when the input receptive field is not fully filled yet. And it's learning to predict a discontinuity in the audio, wherever that first sample in the chunk happens to be, when there are no preceding samples. So ideally I think we wouldn't chunk at all; you'd always be working on one long continuous stream of data. Except the memory required grows linearly with duration. So I'd make sample size as large as your memory permits.

A couple possibilties for overcoming memory constraint: switch from float32 to float16. Every tensor will use exactly half as much memory. I tried a hack where I naively replaced all dtype=float32 with dtype=float16, and immediately got NaNs. I don't know if that's because fp16 is that much more numerically unstable, or if it's because there's something else in the code that I missed that was expecting float32. Also, I think choosing the sgd optimizer uses less memory than adam. Doesn't require as many copies of tensors, if I'm not mistaken. That could be a factor of two or so also.

nakosung commented 7 years ago

Randomly cropped samples might be beneficial to make truncation not to introduce unnecessary side-effects. Although we use same train data set for each epochs, input sample sequences are randomly cropped so the unnecessary 'boundary-effect' could be blurred.

ibab commented 7 years ago

Yeah, I'd recommend to set the batch size to 1 and choose the sample size as large as possible. Cutting the samples to a fixed size is useful, as it prevents us from crashing when one of the samples is particularly large. batch_size > 1 should come in handy if we want to use things like batch normalization.

chrisnovello commented 7 years ago

Here is another clip using those same settings, except

I do hear more quality top end data than in the 16k clips that I've generated, and it is a large enough receptive field to generate language-like chunks (rather than the 44.1k clips I made, which felt stuck in granular synth / timestretch hiss / reverse reverb territory)

nakosung commented 7 years ago

Could you post settings used?

2016년 10월 17일 (월) 오전 11:13, Chris Novello notifications@github.com님이 작성:

Here is another clip using those same settings, except

  • trained/generated using a 22.5k sample rate
  • trained on a dataset of my own voice as large as each individual VCTK entry

I do hear more quality top end data than in the 16k clips that I've generated, and it is a large enough receptive field to generate language-like chunks (rather than the 44.1k clips I made, which felt stuck in granular synth / timestretch hiss / reverse reverb territory)

https://soundcloud.com/paperkettle/wavenet-babble-test-trained-a-neural-network-to-speak-with-my-voice

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/124#issuecomment-254096891, or mute the thread https://github.com/notifications/unsubscribe-auth/ACWXUzhJLKcG73rKrVi6m1xcMvbDzlAhks5q0tnKgaJpZM4KPjlI .

chrisnovello commented 7 years ago

~40k iterations with sample rate 100000 learning rate .001

{ "filter_width": 2, "sample_rate": 22500, "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512], "residual_channels": 32, "dilation_channels":32, "quantization_channels": 256, "skip_channels": 1024, "use_biases": true, "scalar_input": false, "initial_filter_width": 32 }

Zeta36 commented 7 years ago

@paperkettle, you said: "trained on a dataset of my own voice". How well does the model imitating your voice in that clip?? and how did you prepare the dataset? Did you record yourself saying the same phrases than in one VCTK speaker?

chrisnovello commented 7 years ago

@zeta36 The clip reproduces some of the character of my recorded dataset, for sure. Glitchier and otherworldly (I suspect a larger receptive field would help).

"Did you record yourself saying the same phrases than in one VCTK speaker?" — yeah.

I re-recorded a vocal set from VCTK (toward future experiments of training on my voice + the full dataset). I added some compression and a little bit of de essing. Recorded on a $100 condenser mic in a living room (so yes some room sound in my dataset — more than the VCTK but not that much).

I found the VCTK txts awkward to read (I would never speak in the style they're written) and thus the personality of my dataset is pretty announcerly. I'm planning to sample my own writing and do more passes in the near future.

Zeta36 commented 7 years ago

@paperkettle, some people in here #112 is beginning to develop the local conditioning. Your voice could be one of the first using and testing this feature. I recommend you forking the @alexbeloi development in this sense and make a try.

Regards.

adamalpi commented 7 years ago

Hello, I have tried several configurations with many datasets and I have seen a pattern over them when it goes to train 44.1kHz songs and 16kHz songs. They show a difference of loss of about 1 with same configurations, any guess why it is happening?

I have been training with very low sample size of 16k or 32k as my gpu runs out of memory with bigger numbers.

Regards-