Read random audio segments when training

veqtor commented 7 years ago

Currently, when training on long audio files (music), training overfits after a while and the loss then explodes when reaching the next file, this can be seen in this screenshot from tensorboard:

If we make the audio-reader shuffle segments used for training, we would avoid having the training moving into a dead end and instead look for an universal solution that works for all the training material.

fehiepsi commented 7 years ago

I think that the loss explodes when reaching the next file mainly because the loss includes first receptive field. I have been thinking about this behavior for a few days. Some supporting evidences:

as @veqtor mentioned above about the loss explodes when reaching the next file.
i trained an overfitting model for sin wave audio file (so the network trained the same file repeatedly), and the generation totally fails if I am not putting in the same wave seed; with the same seed, the generation just works well
as i mentioned in https://github.com/ibab/tensorflow-wavenet/issues/179 , in ByteNet (written by some authors of WaveNet), they calculate the loss ignoring the first receptive field.

It seems that I should make a test about this. Any suggested idea for testing? Is it true that I should calculate the total numbers of dilations + 1 (for first casual dilation), then slice the prediction with this number to calculate the loss?

veqtor commented 7 years ago

Isn't the receptive field fed with audio from before the sample size?

fehiepsi commented 7 years ago

In the definition of causal_conv, we pad 0 at the beginning. The first receptive field of the output is obtained by using these zeroes to make prediction. Though we pad 0 at each layer, you can think (not exactly) of the whole network as a convolution neural net with filter_width = receptive_field and we pad receptive_field number of 0 at the beginning of the input.

fehiepsi commented 7 years ago

Good news, by adapting the above idea, after 50 steps, I get the loss at the level of the old behavior with 1000 steps. And the loss is still decreasing well. Right now, at 150 steps, the loss is around 3.0, which is pretty good. I will let it run about 20000 steps, which is equivalent to 2 epochs cause I am training on 2 people data of VCTK corpus (to test the global_condition implemented by @jyegerlehner with option scalar_input = True). Hope that this will give us a right direction.

jyegerlehner commented 7 years ago

PR #168 shuffles the files during training.

fehiepsi commented 7 years ago

Hi @jyegerlehner , can you give some informations about the loss (with expected time step?) you want to obtain with scalar_input option ?

jyegerlehner commented 7 years ago

@fehiepsi The blue line on the bottom is the current scalar_input experiment.

screenshot from 2016-11-18 14_45_27

fehiepsi commented 7 years ago

It seems good! Right now there are some bugs appear when I generate using scalar input (even with the slow method). I will test the above idea again without scalar input while figures out what causes bug.

@jyegerlehner : Do you observe that the training step with scalar_input = True is 5 times slower than without using scalar input? In my case, without using gives me 0.5s / step while using scalar input takes me 2.5s / step.

jyegerlehner commented 7 years ago

@fehiepsi : I'm not seeing that slowdown. I'm seeing < 1 sec per step using scalar input. Are you sure the only difference between the slow and fast training is scalar_input = True?

fehiepsi commented 7 years ago

Yes, I observed it even before working on your global condition pull request. I observed that using scalar input then tensorflow not raising pool_size_limit_ as much as not using, so I guess that is the problem. You have fast GPU so I believe it is normal to see < 1 / step with scalar input in your case. How about your training speed without scalar input? I am just curious, have no intention to consider the difference in my case is a bug. :)

jyegerlehner commented 7 years ago

Hi @fehiepsi

OK I ran the experiment where I trained both with scalar_input=true and scalar_input=false and got the following results:

scalar_input=true: 0.55 sec/step scalar_input=false: 0.56 sec/step

where the time-per-step is averaged over a screen-full of time/step values.

One thing that might be different is that I'm running a version of tensorflow pulled from master nightly builds in the last few days (namely, tensorflow commit 7e94b1ee27cb4e009f8bbee1f230c7ca3adfccf3 , not 0.10 nor 0.11). I did notice there seemed to be a difference in behavior between it and the 0.10 version I had been running. The newer one seems faster. Also, one needs to wait a few minutes until the heap manager stops increasing the pool limit size.

fehiepsi commented 7 years ago

@jyegerlehner : Thank you!, nevermind about the speed, I update tensorflow from 0.11rc2 to 0.11 and get worse performance >"< (not willing to try nightly build cause i just want to be stable); anyway, dont mind about it. :)

As I observed with both case scalar_input=False and scalar_input=True, the loss is decreasing faster if we not use the first receptive field for calculating loss (I hope that I have been more professional to give a good name for my training models, so I can do the same thing again; now my folders are messing so I cant find things to draw to compare). As you mentioned about #118 , I have thought about it. I did not disable trimming method (or decreasing trimming threshold...), but I think that they are also good ways to make wavenet model recognize the silent. In ByteNet, the authors also remove the first receptive field. The audio has much larger receptive field than text, so the prediction for first receptive field with 0 padding is worse (? not so sure), so we should not use it. What do you think?

In addition to removing the first receptive field for the loss, I did the following change to get over the mixing of global condition: use condition len(buffer) > 10000 and i reset the buffer to empty for each new audio file. We dont use the last data of audio file with len(buffer) < 10000 (for effective and also avoiding mixing global condition). We can use different number other than 10000 of course.

veqtor commented 7 years ago

@jyegerlehner wow, what rig do you get 0,55 sec/step and what are the hyperparameters?

jyegerlehner commented 7 years ago

@veqtor

i7-5930K quad-channel ddr4, 1 x titan-xp

Another thing that makes my times faster is that I'm working in a branch where not every training batch item is exactly sample_size samples. Instead it's usually the length of the file which is usually less than 100000 (the default sample_size). That makes each step faster on average.

The 0.55 sec/step time corresponds to:

{
    "filter_width": 2,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
    "residual_channels": 32,
    "dilation_channels": 32,
    "quantization_channels": 256,
    "skip_channels": 1024,
    "use_biases": true,
    "scalar_input": false [or could be true],
    "initial_filter_width": 2,
    "residual_postproc": false
}

and --gc_channels=32

weixsong commented 7 years ago

@jyegerlehner , I'm training on Tesla K80, but seems it a little slow.

Average time for one mini-batch is 2 second / step And I'm using the default configuration from this WaveNet.

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 126311 C python 10948MiB | +-----------------------------------------------------------------------------+

Do you think is this speed make sense?

akademi4eg commented 7 years ago

@weixsong , I recently tested on AWS p2.xlarge instance. In fact it provides half of Teska K80, so technically it is Tesla K40 :) 2 seconds per steps seems ok. You will get similar speed on GTX 1070. I guess the reson why tesla is slower then expected is that it has rather old architecture (Kepler in Tesla vs Pascal in 1070).

ibab / tensorflow-wavenet

Read random audio segments when training #180