Open veqtor opened 7 years ago
I think that the loss explodes when reaching the next file mainly because the loss includes first receptive field. I have been thinking about this behavior for a few days. Some supporting evidences:
It seems that I should make a test about this. Any suggested idea for testing? Is it true that I should calculate the total numbers of dilations + 1 (for first casual dilation), then slice the prediction with this number to calculate the loss?
Isn't the receptive field fed with audio from before the sample size?
In the definition of causal_conv
, we pad 0 at the beginning. The first receptive field of the output is obtained by using these zeroes to make prediction. Though we pad 0 at each layer, you can think (not exactly) of the whole network as a convolution neural net with filter_width = receptive_field
and we pad receptive_field
number of 0 at the beginning of the input.
Good news, by adapting the above idea, after 50 steps, I get the loss at the level of the old behavior with 1000 steps. And the loss is still decreasing well. Right now, at 150 steps, the loss is around 3.0, which is pretty good. I will let it run about 20000 steps, which is equivalent to 2 epochs cause I am training on 2 people data of VCTK corpus (to test the global_condition
implemented by @jyegerlehner with option scalar_input = True
). Hope that this will give us a right direction.
PR #168 shuffles the files during training.
Hi @jyegerlehner , can you give some informations about the loss (with expected time step?) you want to obtain with scalar_input
option ?
@fehiepsi The blue line on the bottom is the current scalar_input experiment.
It seems good! Right now there are some bugs appear when I generate using scalar input (even with the slow method). I will test the above idea again without scalar input
while figures out what causes bug.
@jyegerlehner : Do you observe that the training step with scalar_input = True
is 5 times slower than without using scalar input? In my case, without using gives me 0.5s / step while using scalar input takes me 2.5s / step.
@fehiepsi : I'm not seeing that slowdown. I'm seeing < 1 sec per step using scalar input. Are you sure the only difference between the slow and fast training is scalar_input = True
?
Yes, I observed it even before working on your global condition pull request. I observed that using scalar input then tensorflow not raising pool_size_limit_
as much as not using, so I guess that is the problem. You have fast GPU so I believe it is normal to see < 1 / step
with scalar input in your case. How about your training speed without scalar input? I am just curious, have no intention to consider the difference in my case is a bug. :)
Hi @fehiepsi
OK I ran the experiment where I trained both with scalar_input=true
and scalar_input=false
and got the following results:
scalar_input=true: 0.55 sec/step
scalar_input=false: 0.56 sec/step
where the time-per-step is averaged over a screen-full of time/step values.
One thing that might be different is that I'm running a version of tensorflow pulled from master nightly builds in the last few days (namely, tensorflow commit 7e94b1ee27cb4e009f8bbee1f230c7ca3adfccf3 , not 0.10 nor 0.11). I did notice there seemed to be a difference in behavior between it and the 0.10 version I had been running. The newer one seems faster. Also, one needs to wait a few minutes until the heap manager stops increasing the pool limit size.
@jyegerlehner : Thank you!, nevermind about the speed, I update tensorflow from 0.11rc2 to 0.11 and get worse performance >"< (not willing to try nightly build cause i just want to be stable); anyway, dont mind about it. :)
As I observed with both case scalar_input=False
and scalar_input=True
, the loss is decreasing faster if we not use the first receptive field for calculating loss (I hope that I have been more professional to give a good name for my training models, so I can do the same thing again; now my folders are messing so I cant find things to draw to compare). As you mentioned about #118 , I have thought about it. I did not disable trimming method (or decreasing trimming threshold...), but I think that they are also good ways to make wavenet model recognize the silent. In ByteNet, the authors also remove the first receptive field. The audio has much larger receptive field than text, so the prediction for first receptive field with 0 padding is worse (? not so sure), so we should not use it. What do you think?
In addition to removing the first receptive field for the loss, I did the following change to get over the mixing of global condition: use condition len(buffer) > 10000
and i reset the buffer to empty for each new audio file. We dont use the last data of audio file with len(buffer) < 10000 (for effective and also avoiding mixing global condition). We can use different number other than 10000
of course.
@jyegerlehner wow, what rig do you get 0,55 sec/step and what are the hyperparameters?
@veqtor
i7-5930K quad-channel ddr4, 1 x titan-xp
Another thing that makes my times faster is that I'm working in a branch where not every training batch item is exactly sample_size
samples. Instead it's usually the length of the file which is usually less than 100000 (the default sample_size
). That makes each step faster on average.
The 0.55 sec/step time corresponds to:
{
"filter_width": 2,
"sample_rate": 16000,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
"residual_channels": 32,
"dilation_channels": 32,
"quantization_channels": 256,
"skip_channels": 1024,
"use_biases": true,
"scalar_input": false [or could be true],
"initial_filter_width": 2,
"residual_postproc": false
}
and --gc_channels=32
@jyegerlehner , I'm training on Tesla K80, but seems it a little slow.
Average time for one mini-batch is 2 second / step And I'm using the default configuration from this WaveNet.
From nvidia-smi, it seems that GPU usage is 100% already: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.57 Driver Version: 367.57 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 8108:00:00.0 Off | 0 | | N/A 83C P0 88W / 149W | 10952MiB / 11439MiB | 100% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 126311 C python 10948MiB | +-----------------------------------------------------------------------------+
Do you think is this speed make sense?
@weixsong , I recently tested on AWS p2.xlarge instance. In fact it provides half of Teska K80, so technically it is Tesla K40 :) 2 seconds per steps seems ok. You will get similar speed on GTX 1070. I guess the reson why tesla is slower then expected is that it has rather old architecture (Kepler in Tesla vs Pascal in 1070).
Currently, when training on long audio files (music), training overfits after a while and the loss then explodes when reaching the next file, this can be seen in this screenshot from tensorboard:
If we make the audio-reader shuffle segments used for training, we would avoid having the training moving into a dead end and instead look for an universal solution that works for all the training material.