microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

DNN FrameMode #728

Closed yeli7289 closed 7 years ago

yeli7289 commented 8 years ago

I am trying to implement speech denoising regression framework with DNN and RNN, and I have discover a strange thing. I use blockrandomize and set the scp file with the length with each speech utterance. The following is my scp file. image

When I set my frameMode=false, the training error is much higher than when I set it to true. The following is a portion of my config file image And the result I got: image

If I simply change the frameMode=true, the result will be: image

One thing I don't understand is that the "numMBsToShowResult", it is supposed to show 2*200 frames everytime. When frameMode=true, it does. But when I set the frameMode to "false", it doesn't even close to 400. Do I misunderstand any setting in cntk.

Since if I want to improve my model to RNN, I will have to set the frameMode=false. I really bother by this problem a lot.

Ask for help @frankseide @dongyu888 Thanks a lot

frankseide commented 8 years ago

First of all, it is common that framemode converges faster--often much faster. The reason is that the frames within a minibatch are decorrelated, whereas in sequence mode, frames are very similar to their neighbors, and hence are redundant and thus of much less value for a model update. But I am still surprised that it does make such a difference.

In the framemode version, it seems that every minibatch only has 2 samples. That is very small. It also explains the low speed, since small minibatches typically do not load the GPUs fully. A better setting would be ~256 at the start, and you can grow it later.

In sequence mode, you still specify the number of samples, NOT the number of sequences. However, since the length is variable, this number cannot be fulfilled precisely by the CNTK reader. The rule is this:

For you, since this is speech, typical speech utterances are at least a few seconds long, and a second is 100 samples. So every single utterance will exceed the limit of 2 frames (no speech utterance is only 20 ms long). Thus, every single minibatch will consist of a single utterance.

Both runs you showed have processed the same number of frames. We must now examine why one converged so well and the other has not. One reason is the correlation of samples within a sequence, as mentioned above. I do not see any other obvious problem. The learning rate looks right (although I strongly recommend to specify it as learningRatePerSample instead). I can only guess that a MB size of ~470 is simply too large at the beginning of the training, especially since so many samples are correlated.

One thing you can try is to specify a momentum parameter (momentumAsTimeConstant = 2500), which may counteract large jumps of the model. The other thing to try would be to reduce the learning rate, e.g. by a factor of 10, for the sequence mode. But that would also slow down learning quite a bit.

You could also run a contrast with framemode where you set the minibatch size to 470, and see if it converges equally well compared to minibatch size 2. If not, then we know it's the MB size.

numMBsToShowResult is only a logging parameter. It does not affect the training, only the increment for which partial criterion values are shown. The second run shows 2*200 samples as expected.

yeli7289 commented 8 years ago

Thanks @frankseide for your patient and reply, I use 2 only because it's better to illustrate the problem I encounter, I original use MB size as 200 as I run RNN and I found a much worse result than my DNN counterpart and the same fast converge problem happens, so I start to find the problem. Do you think it is possible that many of my clean utterances have a short low noise silence (probably a few frames) and this lead to a bad starting point and cause the model to converge to local minimum easily? What is the typical solution to avoid the local minimum to happen?

Another question is that if I want to implement RNN, then my MB size should not set something like 470, how should I deal with the reader issue? Thanks a lot

zhouwangzw commented 7 years ago

The CNTK V2 API (C++ and Python) provides more control on training. Please try with the latest CNTK release. If you still run into problems, please open a new issue.