kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.32k stars 5.33k forks source link

Tune nnet3 LSTMs #134

Closed danpovey closed 9 years ago

danpovey commented 9 years ago

I am creating an issue for this in case we can get wider help-- Right now the thing that needs to be done that's most urgent for me is to get LSTMs tuned and working in the nnet3 setup. I think at this point the essential code is all written and it's a question of tuning the scripts (e.g. adding more layers). This is the limiting factor on me getting CTC working too (I'm doing it in a private branch, but I can't really test it until we have some recurrent setup working well). What I'd like done this this: @vijayaditya and @pegahgh, can you please (fairly urgently) make sure that the essential pieces of your work so far are checked in, and provide pointers to it here in this thread? We can just check in the best configurations you have so far. After that I am hoping others such as @naxingyu and @nichongjia will be able to help tune it using their setups.

naxingyu commented 9 years ago

I'm happy to participate.

nichongjia commented 9 years ago

Glad to do this.

danpovey commented 9 years ago

Cool. Vijay is making sure that stuff is checked in. Vijay, please see if there is anything that Pegah did in the interim which needs to be checked in too. She got some improvements out of label delay, I think. Dan

On Mon, Sep 14, 2015 at 4:09 AM, nichongjia notifications@github.com wrote:

Glad to do this.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-139992245.

danpovey commented 9 years ago

OK, Pegah has checked in her latest changes. Guys, you don't have to do this on AMI if it's not convenient-- just adapt the scripts to whatever setup is convenient for you. Right now I am debugging some code which enables us to truncate the BPTT and not apply it to the left-context frames. That should help both with speed and avoiding gradient explosion.

danpovey commented 9 years ago

Another note: it's possible that Pegah's change -chunk_width=20 -chunk_left_context=20 +chunk_width=10 +chunk_left_context=10 does not correspond to what she ran, and we might have to change it back. this relates to the durations of the chunks of data that we train on. Dan

naxingyu commented 9 years ago

I have a hkust and a swbd setup running, both using Pegah's ami setup. In the hkust setup, iteration 0 works fine, but the network explodes on iteration 1. Training log shows objf jumps from -4.34102 to -6108.87. The swbd setup seems to be running ok. Now I start tuning on these two setups. Any suggestion?

On 09/15/2015 06:39 AM, Daniel Povey wrote:

Another note: it's possible that Pegah's change -chunk_width=20 -chunk_left_context=20 +chunk_width=10 +chunk_left_context=10 does not correspond to what she ran, and we might have to change it back. this relates to the durations of the chunks of data that we train on. Dan

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140225567.

vijayaditya commented 9 years ago

Hi Xingyu this happens when there is a gradient explosion in one of the intermediate steps. Could you check what is the clipping threshold you are using. You could try using a smaller clipping threshold.

nnet3-am-info |grep clipping

danpovey commented 9 years ago

Also, does the objf get worse within all of the individual jobs train.1.XX.log, or just one of them? My instinct is that you could try decreasing the max-change values in the config and/or the learning rates. But as Vijay said, the gradient clipping threshold might make a difference too. Looking at the nnet3-info output in the progress.XX.log might help show which parameters are exploding, if any- check the stddevs. Dan

On Tue, Sep 15, 2015 at 10:09 AM, Vijayaditya Peddinti < notifications@github.com> wrote:

Hi Xingyu this happens when there is a gradient explosion in one of the intermediate steps. Could you check what is the clipping threshold you are using. You could try using a smaller clipping threshold

nnet3-am-info |grep clipping

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140405070.

vijayaditya commented 9 years ago

@naxingyu one other thing, I just pushed in some changes where the default learning rates in the steps/nnet3/lstm/train.sh have been modified. Ensure that you are using the learning rates in this range.

naxingyu commented 9 years ago

Thanks Dan and Vijay. I checked the latest commits, recompiled the binaries, reduced clipping threshold from 10.0 to 5.0, and ensured the learning rates in the modified range.

  1. hkust setup still explodes from very beginning. When it explodes, it happens on all of the individual jobs. 'clipped-proportion' is above 0.9, on both cell and recurrent components.

2. swbd setup explodes later, in a similar behavior as hkust setup.

I see Vijay commit the min-deriv-time option when I compose this comment. I'll try that ASAP.

On 09/16/2015 02:41 AM, Vijayaditya Peddinti wrote:

@naxingyu https://github.com/naxingyu one other thing, I just pushed in some changes where the default learning rates in the steps/nnet3/lstm/train.sh have been modified. Ensure that you are using the learning rates in this range.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140494426.

danpovey commented 9 years ago

Try smaller learning-rates and/or smaller max-change. Dan

On Tue, Sep 15, 2015 at 10:52 PM, Xingyu Na notifications@github.com wrote:

Thanks Dan and Vijay. I checked the latest commits, recompiled the binaries, reduced clipping threshold from 10.0 to 5.0, and ensured the learning rates in the modified range.

  1. hkust setup still explodes from very beginning. When it explodes, it happens on all of the individual jobs. 'clipped-proportion' is above 0.9, on both cell and recurrent components.

2. swbd setup explodes later, in a similar behavior as hkust setup.

I see Vijay commit the min-deriv-time option when I compose this comment. I'll try that ASAP.

On 09/16/2015 02:41 AM, Vijayaditya Peddinti wrote:

@naxingyu https://github.com/naxingyu one other thing, I just pushed in some changes where the default learning rates in the steps/nnet3/lstm/train.sh have been modified. Ensure that you are using the learning rates in this range.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140494426.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610086.

naxingyu commented 9 years ago

OK. What 'max-change'? didn't see an option for that...

On 09/16/2015 10:54 AM, Daniel Povey wrote:

Try smaller learning-rates and/or smaller max-change. Dan

On Tue, Sep 15, 2015 at 10:52 PM, Xingyu Na notifications@github.com wrote:

Thanks Dan and Vijay. I checked the latest commits, recompiled the binaries, reduced clipping threshold from 10.0 to 5.0, and ensured the learning rates in the modified range.

  1. hkust setup still explodes from very beginning. When it explodes, it happens on all of the individual jobs. 'clipped-proportion' is above 0.9, on both cell and recurrent components.

2. swbd setup explodes later, in a similar behavior as hkust setup.

I see Vijay commit the min-deriv-time option when I compose this comment. I'll try that ASAP.

On 09/16/2015 02:41 AM, Vijayaditya Peddinti wrote:

@naxingyu https://github.com/naxingyu one other thing, I just pushed in some changes where the default learning rates in the steps/nnet3/lstm/train.sh have been modified. Ensure that you are using the learning rates in this range.

— Reply to this email directly or view it on GitHub

https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140494426.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610086.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610278.

danpovey commented 9 years ago

Can you show us an example log where the objf is very bad? I want to see if it starts from the beginning of the job, or in the middle of the job.

On Tue, Sep 15, 2015 at 10:54 PM, Daniel Povey dpovey@gmail.com wrote:

Try smaller learning-rates and/or smaller max-change. Dan

On Tue, Sep 15, 2015 at 10:52 PM, Xingyu Na notifications@github.com wrote:

Thanks Dan and Vijay. I checked the latest commits, recompiled the binaries, reduced clipping threshold from 10.0 to 5.0, and ensured the learning rates in the modified range.

  1. hkust setup still explodes from very beginning. When it explodes, it happens on all of the individual jobs. 'clipped-proportion' is above 0.9, on both cell and recurrent components.

2. swbd setup explodes later, in a similar behavior as hkust setup.

I see Vijay commit the min-deriv-time option when I compose this comment. I'll try that ASAP.

On 09/16/2015 02:41 AM, Vijayaditya Peddinti wrote:

@naxingyu https://github.com/naxingyu one other thing, I just pushed in some changes where the default learning rates in the steps/nnet3/lstm/train.sh have been modified. Ensure that you are using the learning rates in this range.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140494426.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610086.

vijayaditya commented 9 years ago

@naxingyu When reducing clipping threshold monitor the clipping-proportion in progress*.log. If this is high the objective function usually does not improve.

you can add max-change parameter in steps/nnet3/lstm.sh using options ng_per_element_scale_options= ng_affine_options=

danpovey commented 9 years ago

Regarding 'max-change', it may not be in the script, but the NaturalGradientAffineComponent supports the 'max-change-per-sample' option whose default value is 0.075 (would be in the line 'component name=xxx...'), and the NaturalGradientPerElementScaleComponent supports the 'max-change-per-minibatch' option whose default value is 0.5.

Dan

On Tue, Sep 15, 2015 at 10:55 PM, Xingyu Na notifications@github.com wrote:

OK. What 'max-change'? didn't see an option for that...

On 09/16/2015 10:54 AM, Daniel Povey wrote:

Try smaller learning-rates and/or smaller max-change. Dan

On Tue, Sep 15, 2015 at 10:52 PM, Xingyu Na notifications@github.com wrote:

Thanks Dan and Vijay. I checked the latest commits, recompiled the binaries, reduced clipping threshold from 10.0 to 5.0, and ensured the learning rates in the modified range.

  1. hkust setup still explodes from very beginning. When it explodes, it happens on all of the individual jobs. 'clipped-proportion' is above 0.9, on both cell and recurrent components.

2. swbd setup explodes later, in a similar behavior as hkust setup.

I see Vijay commit the min-deriv-time option when I compose this comment. I'll try that ASAP.

On 09/16/2015 02:41 AM, Vijayaditya Peddinti wrote:

@naxingyu https://github.com/naxingyu one other thing, I just pushed in some changes where the default learning rates in the steps/nnet3/lstm/train.sh have been modified. Ensure that you are using the learning rates in this range.

— Reply to this email directly or view it on GitHub

https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140494426.

— Reply to this email directly or view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610086 .

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610278.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610410.

naxingyu commented 9 years ago

Both. I show you when it happens in the first 10 minibatches.

# Running on g19
# Started at Wed Sep 16 10:16:21 CST 2015
# nnet3-train --print-interval=10 --update-per-minibatch=true "nnet3-am-copy --raw=true --learning-rate=0.000596534446947406 exp/nnet3/lstm_ld5/1.mdl -|" "ark:nnet3-copy-egs --left-context=40 --right-context=7 ark:exp/nnet3/lstm_ld5/egs/egs.3.ark ark:- | nnet3-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:-| nnet3-merge-egs --minibatch-size=100 --measure-output-frames=false ark:- ark:- |" exp/nnet3/lstm_ld5/2.1.raw 
nnet3-train --print-interval=10 --update-per-minibatch=true 'nnet3-am-copy --raw=true --learning-rate=0.000596534446947406 exp/nnet3/lstm_ld5/1.mdl -|' 'ark:nnet3-copy-egs --left-context=40 --right-context=7 ark:exp/nnet3/lstm_ld5/egs/egs.3.ark ark:- | nnet3-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:-| nnet3-merge-egs --minibatch-size=100 --measure-output-frames=false ark:- ark:- |' exp/nnet3/lstm_ld5/2.1.raw 
LOG (nnet3-train:IsComputeExclusive():cu-device.cc:246) CUDA setup operating under Compute Exclusive Mode.
LOG (nnet3-train:FinalizeActiveGpu():cu-device.cc:213) The active GPU is [0]: Tesla K20m    free:4705M, used:94M, total:4799M, free/total:0.980312 version 3.5
nnet3-am-copy --raw=true --learning-rate=0.000596534446947406 exp/nnet3/lstm_ld5/1.mdl - 
LOG (nnet3-am-copy:main():nnet3-am-copy.cc:96) Copied neural net from exp/nnet3/lstm_ld5/1.mdl to raw format as -
nnet3-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:- 
nnet3-copy-egs --left-context=40 --right-context=7 ark:exp/nnet3/lstm_ld5/egs/egs.3.ark ark:- 
nnet3-merge-egs --minibatch-size=100 --measure-output-frames=false ark:- ark:- 
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.366088, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.79104 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.279167
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.525689, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.41438 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.353512
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.549739, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.84376 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.271185
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.529195, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.7881 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.279627
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.545474, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.82206 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.274415
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.497255, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.92931 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.25916
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.499546, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.36253 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.366964
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.525674, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.80693 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.276712
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.528004, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.40109 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.356866
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253) Limiting step size using scaling factor 0.504728, for component Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 2.65101 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.188608
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 0-9 is -41.225 over 19451 frames.
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter change 1.06435 exceeds --max-change-per-minibatch=0.5 for this minibatch, for Lstm1_w_oc, scaling by factor 0.469769
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 10-19 is -49.1079 over 19514 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 20-29 is -53.304 over 19421 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 30-39 is -46.8068 over 19406 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 40-49 is -35.0044 over 19538 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 50-59 is -32.7891 over 19423 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 60-69 is -42.1689 over 19457 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 70-79 is -34.7181 over 19393 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 80-89 is -37.7776 over 19413 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 90-99 is -44.3953 over 19418 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 100-109 is -40.2744 over 19495 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 110-119 is -31.8361 over 19481 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 120-129 is -35.3526 over 19385 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 130-139 is -36.5132 over 19468 frames.
LOG (nnet3-copy-egs:main():nnet3-copy-egs.cc:339) Read 20029 neural-network training examples, wrote 20029
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 140-149 is -33.5093 over 19444 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 150-159 is -30.1704 over 19348 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 160-169 is -29.4314 over 19450 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 170-179 is -25.2375 over 19356 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 180-189 is -34.1938 over 19466 frames.
LOG (nnet3-shuffle-egs:main():nnet3-shuffle-egs.cc:103) Shuffled order of 20112 neural-network training examples using a buffer (partial randomization)
LOG (nnet3-merge-egs:main():nnet3-merge-egs.cc:121) Merged 20029 egs to 201.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average objective function for 'output' for minibatches 190-199 is -30.8311 over 19354 frames.
LOG (nnet3-train:PrintTotalStats():nnet-training.cc:129) Overall average objective function for 'output' is -37.2068 over 389248 frames.
LOG (nnet3-train:PrintProfile():cu-device.cc:415) -----
[cudevice profile]
CopyRows    1.52818s
MulColsVec  1.58382s
CuMatrix::SetZero   1.64581s
AddDiagMatMat   2.01596s
AddVec  2.12538s
CopyFromVec<float>  2.17151s
CuVectorBase::ApplyFloor    2.25274s
Set 2.77716s
CuVector::SetZero   3.61297s
MulElements 3.72628s
AddMatVec   3.76281s
AddMat  4.11245s
CuVector::Resize    4.63484s
CuMatrixBase::CopyFromMat(from other CuMatrixBase)  9.23624s
AddMatMat   27.4815s
Total GPU time: 82.4288s (may involve some double-counting)
-----
LOG (nnet3-train:PrintMemoryUsage():cu-allocator.cc:126) Memory usage: 789164880 bytes currently allocated (max: 807610216); 16182772 currently in use by user (max: 538426172); 978/538361 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-train:PrintMemoryUsage():cu-allocator.cc:133) Time taken in cudaMallocPitch=0.210049, in cudaMalloc=0.0545287, in cudaFree=0.00750351, in this->MallocPitch()=1.14464
LOG (nnet3-train:PrintMemoryUsage():cu-device.cc:388) Memory used (according to the device): 891871232 bytes.
LOG (nnet3-train:main():nnet3-train.cc:86) Wrote model to exp/nnet3/lstm_ld5/2.1.raw
# Accounting: time=103 threads=1
# Finished at Wed Sep 16 10:18:04 CST 2015 with status 0
danpovey commented 9 years ago

OK, but there are two possibilities here. It could have diverged already within the first few minibatches, or it could have been that the model was already bad. What does the compute_prob_train.1.log look like, i.e. what is the probability there?

Dan

On Tue, Sep 15, 2015 at 11:01 PM, Xingyu Na notifications@github.com wrote:

Both. I show you when it happens in the first 10 minibatches.

# Running on g19
# Started at Wed Sep 16 10:16:21 CST 2015
# nnet3-train --print-interval=10 --update-per-minibatch=true
"nnet3-am-copy --raw=true --learning-rate=0.000596534446947406
exp/nnet3/lstm_ld5/1.mdl -|" "ark:nnet3-copy-egs --left-context=40
--right-context=7 ark:exp/nnet3/lstm_ld5/egs/egs.3.ark ark:- |
nnet3-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:-|
nnet3-merge-egs --minibatch-size=100 --measure-output-frames=false ark:-
ark:- |" exp/nnet3/lstm_ld5/2.1.raw
nnet3-train --print-interval=10 --update-per-minibatch=true
'nnet3-am-copy --raw=true --learning-rate=0.000596534446947406
exp/nnet3/lstm_ld5/1.mdl -|' 'ark:nnet3-copy-egs --left-context=40
--right-context=7 ark:exp/nnet3/lstm_ld5/egs/egs.3.ark ark:- |
nnet3-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:-|
nnet3-merge-egs --minibatch-size=100 --measure-output-frames=false ark:-
ark:- |' exp/nnet3/lstm_ld5/2.1.raw
LOG (nnet3-train:IsComputeExclusive():cu-device.cc:246) CUDA setup
operating under Compute Exclusive Mode.
LOG (nnet3-train:FinalizeActiveGpu():cu-device.cc:213) The active GPU is
[0]: Tesla K20m free:4705M, used:94M, total:4799M,
free/total:0.980312 version 3.5
nnet3-am-copy --raw=true --learning-rate=0.000596534446947406
exp/nnet3/lstm_ld5/1.mdl -
LOG (nnet3-am-copy:main():nnet3-am-copy.cc:96) Copied neural net from
exp/nnet3/lstm_ld5/1.mdl to raw format as -
nnet3-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:-
nnet3-copy-egs --left-context=40 --right-context=7
ark:exp/nnet3/lstm_ld5/egs/egs.3.ark ark:-
nnet3-merge-egs --minibatch-size=100 --measure-output-frames=false ark:-
ark:-
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.366088, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.79104 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.279167
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.525689, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.41438 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.353512
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.549739, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.84376 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.271185
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.529195, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.7881 exceeds --max-change-per-minibatch=0.5 for this minibatch,
for Lstm1_w_oc, scaling by factor 0.279627
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.545474, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.82206 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.274415
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.497255, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.92931 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.25916
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.499546, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.36253 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.366964
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.525674, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.80693 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.276712
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.528004, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.40109 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.356866
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.504728, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 2.65101 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.188608
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 0-9 is -41.225 over
19451 frames.
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.06435 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.469769
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 10-19 is -49.1079 over
19514 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 20-29 is -53.304 over
19421 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 30-39 is -46.8068 over
19406 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 40-49 is -35.0044 over
19538 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 50-59 is -32.7891 over
19423 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 60-69 is -42.1689 over
19457 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 70-79 is -34.7181 over
19393 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 80-89 is -37.7776 over
19413 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 90-99 is -44.3953 over
19418 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 100-109 is -40.2744 over
19495 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 110-119 is -31.8361 over
19481 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 120-129 is -35.3526 over
19385 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 130-139 is -36.5132 over
19468 frames.
LOG (nnet3-copy-egs:main():nnet3-copy-egs.cc:339) Read 20029
neural-network training examples, wrote 20029
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 140-149 is -33.5093 over
19444 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 150-159 is -30.1704 over
19348 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 160-169 is -29.4314 over
19450 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 170-179 is -25.2375 over
19356 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 180-189 is -34.1938 over
19466 frames.
LOG (nnet3-shuffle-egs:main():nnet3-shuffle-egs.cc:103) Shuffled order
of 20112 neural-network training examples using a buffer (partial
randomization)
LOG (nnet3-merge-egs:main():nnet3-merge-egs.cc:121) Merged 20029 egs to
201.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 190-199 is -30.8311 over
19354 frames.
LOG (nnet3-train:PrintTotalStats():nnet-training.cc:129) Overall average
objective function for 'output' is -37.2068 over 389248 frames.
LOG (nnet3-train:PrintProfile():cu-device.cc:415) -----
[cudevice profile]
CopyRows 1.52818s
MulColsVec 1.58382s
CuMatrix::SetZero 1.64581s
AddDiagMatMat 2.01596s
AddVec 2.12538s
CopyFromVec<float> 2.17151s
CuVectorBase::ApplyFloor 2.25274s
Set 2.77716s
CuVector::SetZero 3.61297s
MulElements 3.72628s
AddMatVec 3.76281s
AddMat 4.11245s
CuVector::Resize 4.63484s
CuMatrixBase::CopyFromMat(from other CuMatrixBase) 9.23624s
AddMatMat 27.4815s
Total GPU time: 82.4288s (may involve some double-counting)
-----
LOG (nnet3-train:PrintMemoryUsage():cu-allocator.cc:126) Memory usage:
789164880 bytes currently allocated (max: 807610216); 16182772 currently
in use by user (max: 538426172); 978/538361 calls to Malloc* resulted in
CUDA calls.
LOG (nnet3-train:PrintMemoryUsage():cu-allocator.cc:133) Time taken in
cudaMallocPitch=0.210049, in cudaMalloc=0.0545287, in
cudaFree=0.00750351, in this->MallocPitch()=1.14464
LOG (nnet3-train:PrintMemoryUsage():cu-device.cc:388) Memory used
(according to the device): 891871232 bytes.
LOG (nnet3-train:main():nnet3-train.cc:86) Wrote model to
exp/nnet3/lstm_ld5/2.1.raw
# Accounting: time=103 threads=1
# Finished at Wed Sep 16 10:18:04 CST 2015 with status 0

On 09/16/2015 10:55 AM, Daniel Povey wrote:

Can you show us an example log where the objf is very bad? I want to see if it starts from the beginning of the job, or in the middle of the job.

On Tue, Sep 15, 2015 at 10:54 PM, Daniel Povey dpovey@gmail.com wrote:

Try smaller learning-rates and/or smaller max-change. Dan

On Tue, Sep 15, 2015 at 10:52 PM, Xingyu Na notifications@github.com wrote:

Thanks Dan and Vijay. I checked the latest commits, recompiled the binaries, reduced clipping threshold from 10.0 to 5.0, and ensured the learning rates in the modified range.

  1. hkust setup still explodes from very beginning. When it explodes, it happens on all of the individual jobs. 'clipped-proportion' is above 0.9, on both cell and recurrent components.

2. swbd setup explodes later, in a similar behavior as hkust setup.

I see Vijay commit the min-deriv-time option when I compose this comment. I'll try that ASAP.

On 09/16/2015 02:41 AM, Vijayaditya Peddinti wrote:

@naxingyu https://github.com/naxingyu one other thing, I just pushed in some changes where the default learning rates in the steps/nnet3/lstm/train.sh have been modified. Ensure that you are using the learning rates in this range.

— Reply to this email directly or view it on GitHub

https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140494426.

— Reply to this email directly or view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610086 .

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610434.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140611132.

vijayaditya commented 9 years ago

@naxingyu Could you submit a pull request for your hkust and swbd recipes. You could update the pull request as you progress and it would easier for everyone to keep track of your local setup.

BTW in your current setup the problem is due to --left-context=40 . This was due to a bug in make_configs.py introduced in a commit from Pegah. I have removed it in the latest commit. Please update and rerun.

naxingyu commented 9 years ago

compute_prob_train.1.log looks like this: Overall log-likelihood for 'output' is -43.8004 per frame, over 4000 frames. Already bad... And progress.1.log looks like this: component name=Lstm1_c type=ClipGradientComponent, dim=1024, norm-based-clipping=true, clipping-threshold=5, clipped-proportion=0.924855 component name=Lstm1_r type=ClipGradientComponent, dim=256,

norm-based-clipping=true, clipping-threshold=5, clipped-proportion=0.742215

OK I checked train.0.1.log and train.0.2.log, job1 exploded and job2 ran fine. So issue is that 0.1.mdl is already bad before averaging. I have checked in Vijay's latest commit and will update tomorrow.

On 09/16/2015 11:04 AM, Daniel Povey wrote:

OK, but there are two possibilities here. It could have diverged already within the first few minibatches, or it could have been that the model was already bad. What does the compute_prob_train.1.log look like, i.e. what is the probability there?

Dan

On Tue, Sep 15, 2015 at 11:01 PM, Xingyu Na notifications@github.com wrote:

Both. I show you when it happens in the first 10 minibatches.

# Running on g19
# Started at Wed Sep 16 10:16:21 CST 2015
# nnet3-train --print-interval=10 --update-per-minibatch=true
"nnet3-am-copy --raw=true --learning-rate=0.000596534446947406
exp/nnet3/lstm_ld5/1.mdl -|" "ark:nnet3-copy-egs --left-context=40
--right-context=7 ark:exp/nnet3/lstm_ld5/egs/egs.3.ark ark:- |
nnet3-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:-|
nnet3-merge-egs --minibatch-size=100 --measure-output-frames=false ark:-
ark:- |" exp/nnet3/lstm_ld5/2.1.raw
nnet3-train --print-interval=10 --update-per-minibatch=true
'nnet3-am-copy --raw=true --learning-rate=0.000596534446947406
exp/nnet3/lstm_ld5/1.mdl -|' 'ark:nnet3-copy-egs --left-context=40
--right-context=7 ark:exp/nnet3/lstm_ld5/egs/egs.3.ark ark:- |
nnet3-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:-|
nnet3-merge-egs --minibatch-size=100 --measure-output-frames=false ark:-
ark:- |' exp/nnet3/lstm_ld5/2.1.raw
LOG (nnet3-train:IsComputeExclusive():cu-device.cc:246) CUDA setup
operating under Compute Exclusive Mode.
LOG (nnet3-train:FinalizeActiveGpu():cu-device.cc:213) The active GPU is
[0]: Tesla K20m free:4705M, used:94M, total:4799M,
free/total:0.980312 version 3.5
nnet3-am-copy --raw=true --learning-rate=0.000596534446947406
exp/nnet3/lstm_ld5/1.mdl -
LOG (nnet3-am-copy:main():nnet3-am-copy.cc:96) Copied neural net from
exp/nnet3/lstm_ld5/1.mdl to raw format as -
nnet3-shuffle-egs --buffer-size=5000 --srand=1 ark:- ark:-
nnet3-copy-egs --left-context=40 --right-context=7
ark:exp/nnet3/lstm_ld5/egs/egs.3.ark ark:-
nnet3-merge-egs --minibatch-size=100 --measure-output-frames=false ark:-
ark:-
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.366088, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.79104 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.279167
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.525689, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.41438 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.353512
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.549739, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.84376 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.271185
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.529195, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.7881 exceeds --max-change-per-minibatch=0.5 for this minibatch,
for Lstm1_w_oc, scaling by factor 0.279627
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.545474, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.82206 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.274415
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.497255, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.92931 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.25916
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.499546, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.36253 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.366964
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.525674, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.80693 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.276712
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.528004, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.40109 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.356866
LOG (nnet3-train:GetScalingFactor():nnet-simple-component.cc:1253)
Limiting step size using scaling factor 0.504728, for component
Lstm1_W_o-xr
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 2.65101 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.188608
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 0-9 is -41.225 over
19451 frames.
WARNING (nnet3-train:Update():nnet-simple-component.cc:1923) Parameter
change 1.06435 exceeds --max-change-per-minibatch=0.5 for this
minibatch, for Lstm1_w_oc, scaling by factor 0.469769
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 10-19 is -49.1079 over
19514 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 20-29 is -53.304 over
19421 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 30-39 is -46.8068 over
19406 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 40-49 is -35.0044 over
19538 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 50-59 is -32.7891 over
19423 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 60-69 is -42.1689 over
19457 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 70-79 is -34.7181 over
19393 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 80-89 is -37.7776 over
19413 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 90-99 is -44.3953 over
19418 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 100-109 is -40.2744 over
19495 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 110-119 is -31.8361 over
19481 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 120-129 is -35.3526 over
19385 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 130-139 is -36.5132 over
19468 frames.
LOG (nnet3-copy-egs:main():nnet3-copy-egs.cc:339) Read 20029
neural-network training examples, wrote 20029
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 140-149 is -33.5093 over
19444 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 150-159 is -30.1704 over
19348 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 160-169 is -29.4314 over
19450 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 170-179 is -25.2375 over
19356 frames.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 180-189 is -34.1938 over
19466 frames.
LOG (nnet3-shuffle-egs:main():nnet3-shuffle-egs.cc:103) Shuffled order
of 20112 neural-network training examples using a buffer (partial
randomization)
LOG (nnet3-merge-egs:main():nnet3-merge-egs.cc:121) Merged 20029 egs to
201.
LOG (nnet3-train:PrintStatsForThisPhase():nnet-training.cc:121) Average
objective function for 'output' for minibatches 190-199 is -30.8311 over
19354 frames.
LOG (nnet3-train:PrintTotalStats():nnet-training.cc:129) Overall average
objective function for 'output' is -37.2068 over 389248 frames.
LOG (nnet3-train:PrintProfile():cu-device.cc:415) -----
[cudevice profile]
CopyRows 1.52818s
MulColsVec 1.58382s
CuMatrix::SetZero 1.64581s
AddDiagMatMat 2.01596s
AddVec 2.12538s
CopyFromVec<float> 2.17151s
CuVectorBase::ApplyFloor 2.25274s
Set 2.77716s
CuVector::SetZero 3.61297s
MulElements 3.72628s
AddMatVec 3.76281s
AddMat 4.11245s
CuVector::Resize 4.63484s
CuMatrixBase::CopyFromMat(from other CuMatrixBase) 9.23624s
AddMatMat 27.4815s
Total GPU time: 82.4288s (may involve some double-counting)
-----
LOG (nnet3-train:PrintMemoryUsage():cu-allocator.cc:126) Memory usage:
789164880 bytes currently allocated (max: 807610216); 16182772 currently
in use by user (max: 538426172); 978/538361 calls to Malloc* resulted in
CUDA calls.
LOG (nnet3-train:PrintMemoryUsage():cu-allocator.cc:133) Time taken in
cudaMallocPitch=0.210049, in cudaMalloc=0.0545287, in
cudaFree=0.00750351, in this->MallocPitch()=1.14464
LOG (nnet3-train:PrintMemoryUsage():cu-device.cc:388) Memory used
(according to the device): 891871232 bytes.
LOG (nnet3-train:main():nnet3-train.cc:86) Wrote model to
exp/nnet3/lstm_ld5/2.1.raw
# Accounting: time=103 threads=1
# Finished at Wed Sep 16 10:18:04 CST 2015 with status 0

On 09/16/2015 10:55 AM, Daniel Povey wrote:

Can you show us an example log where the objf is very bad? I want to see if it starts from the beginning of the job, or in the middle of the job.

On Tue, Sep 15, 2015 at 10:54 PM, Daniel Povey dpovey@gmail.com wrote:

Try smaller learning-rates and/or smaller max-change. Dan

On Tue, Sep 15, 2015 at 10:52 PM, Xingyu Na notifications@github.com wrote:

Thanks Dan and Vijay. I checked the latest commits, recompiled the binaries, reduced clipping threshold from 10.0 to 5.0, and ensured the learning rates in the modified range.

  1. hkust setup still explodes from very beginning. When it explodes, it happens on all of the individual jobs. 'clipped-proportion' is above 0.9, on both cell and recurrent components.

2. swbd setup explodes later, in a similar behavior as hkust setup.

I see Vijay commit the min-deriv-time option when I compose this comment. I'll try that ASAP.

On 09/16/2015 02:41 AM, Vijayaditya Peddinti wrote:

@naxingyu https://github.com/naxingyu one other thing, I just pushed in some changes where the default learning rates in the steps/nnet3/lstm/train.sh have been modified. Ensure that you are using the learning rates in this range.

— Reply to this email directly or view it on GitHub

https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140494426.

— Reply to this email directly or view it on GitHub

<https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610086 .

— Reply to this email directly or view it on GitHub

https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140610434.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140611132.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140611441.

danpovey commented 9 years ago

I followed up with this text by email but it didn't show up here- very strange. I'll bring it up with GitHub support staff.

"OK, there must be a script bug then. Because the way the script is supposed to work, on iteration zero (and any iteration when you just added a new layer), it should take the job with the best likelihood. Obviously it is not doing that. Perhaps it is not able to parse the liklelihood from the script because it produces output different from nnet2. @vpeddinti, please look at this urgently!"

vijayaditya commented 9 years ago

I verified that the script is selecting the model, but in @naxingyu 's case all the models would have diverged. This is very probable as he was using --left-context 40 without optimization.min-deriv-time . This essentially means that the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe optimization.min-deriv-time is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

danpovey commented 9 years ago

Vijay: @naxingyu said in this thread, "OK I checked train.0.1.log and train.0.2.log, job1 exploded and job2 ran fine. So issue is that 0.1.mdl is already bad before averaging.". But in this case it should have chosen the model from train.0.2.log. @vpeddinti: I notice that the script contains a reference to num_jobs_nnet, which seems to be an undefined variable. Dan

On Wed, Sep 16, 2015 at 4:01 PM, Vijayaditya Peddinti < notifications@github.com> wrote:

I verified that the script is selecting the model, but in @naxingyu https://github.com/naxingyu 's case all the models would have diverged. This is very probable as he was using --left-context 40 without optimization.min-deriv-time . This essentially means that the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe optimization.min-deriv-time is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140870071.

naxingyu commented 9 years ago

Thanks. Now I use Vijay's updated scripts with reduced lrates (0.0002 and 0.00002). Options are "--optimization.min-deriv-time=0" and "--left-context=20 --right-context=7". iter0 and iter1 works fine, with prob_train of -3.98025 and -3.45653. The network explodes at iter2, but only on job train.2.1. After that, 3.1.raw is copied into 3.mdl (as shown in select.2.log), which is not right. I tested the selecting perl command and found that it was looking for "log-prob-per-frame=(\S+)" expression which is not a "to-be-parsed-by-a-script" variable in nnet3 (was in nnet2). So the script always select the first net regardless of "log-prob-per-frame".

On 09/17/2015 04:01 AM, Vijayaditya Peddinti wrote:

I verified that the script is selecting the model, but in @naxingyu https://github.com/naxingyu 's case all the models would have diverged. This is very probable as he was using |--left-context 40 | without |optimization.min-deriv-time |. This essentially means that the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe |optimization.min-deriv-time | is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140870071.

danpovey commented 9 years ago

If you can fix the script yourself, please do via a pull request.

Dan

On Wed, Sep 16, 2015 at 10:18 PM, Xingyu Na notifications@github.com wrote:

Thanks. Now I use Vijay's updated scripts with reduced lrates (0.0002 and 0.00002). Options are "--optimization.min-deriv-time=0" and "--left-context=20 --right-context=7". iter0 and iter1 works fine, with prob_train of -3.98025 and -3.45653. The network explodes at iter2, but only on job train.2.1. After that, 3.1.raw is copied into 3.mdl (as shown in select.2.log), which is not right. I tested the selecting perl command and found that it was looking for "log-prob-per-frame=(\S+)" expression which is not a "to-be-parsed-by-a-script" variable in nnet3 (was in nnet2). So the script always select the first net regardless of "log-prob-per-frame".

On 09/17/2015 04:01 AM, Vijayaditya Peddinti wrote:

I verified that the script is selecting the model, but in @naxingyu https://github.com/naxingyu 's case all the models would have diverged. This is very probable as he was using |--left-context 40 | without |optimization.min-deriv-time |. This essentially means that the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe |optimization.min-deriv-time | is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140870071.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140944421.

danpovey commented 9 years ago

BTW, to keep you abreast of the latest developments.

Dan

On Wed, Sep 16, 2015 at 10:21 PM, Daniel Povey dpovey@gmail.com wrote:

If you can fix the script yourself, please do via a pull request.

Dan

On Wed, Sep 16, 2015 at 10:18 PM, Xingyu Na notifications@github.com wrote:

Thanks. Now I use Vijay's updated scripts with reduced lrates (0.0002 and 0.00002). Options are "--optimization.min-deriv-time=0" and "--left-context=20 --right-context=7". iter0 and iter1 works fine, with prob_train of -3.98025 and -3.45653. The network explodes at iter2, but only on job train.2.1. After that, 3.1.raw is copied into 3.mdl (as shown in select.2.log), which is not right. I tested the selecting perl command and found that it was looking for "log-prob-per-frame=(\S+)" expression which is not a "to-be-parsed-by-a-script" variable in nnet3 (was in nnet2). So the script always select the first net regardless of "log-prob-per-frame".

On 09/17/2015 04:01 AM, Vijayaditya Peddinti wrote:

I verified that the script is selecting the model, but in @naxingyu https://github.com/naxingyu 's case all the models would have diverged. This is very probable as he was using |--left-context 40 | without |optimization.min-deriv-time |. This essentially means that the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe |optimization.min-deriv-time | is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140870071.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140944421.

naxingyu commented 9 years ago

OK. My plan is to fix this selecting issue first, then tune 'max-change' options as you suggested. After that, try the scaling/l2-regularizing stuff.

Best, Xingyu

On 09/17/2015 10:25 AM, Daniel Povey wrote:

BTW, to keep you abreast of the latest developments.

  • Vijay has been experimenting with a version of the code supporting momentum, in a pull request that I created. So far it doesn't seem to help our standard setup, but when I increase the learning rate a lot it does seem to prevent instability (but in those situations the likelihood still degrades, just not nearly as badly).
  • Right now I'm increasingly convinced that the problem is that the parameters are getting too large and the sigmoids are getting oversaturated. My proposed solution is to decrease the parameters slightly (e.g. by a factor of 0.95) on each iteration-- at least, in early epochs. An option could be added to nnet3-copy or nnet3-am-copy to scale the parameters using ScaleNnet(). I know this a bit odd, but it can be viewed as almost equivalent to applying l2 regularization. Do you have time to try this? I'm convinced it will help us get to a better place in parameter space.

Dan

On Wed, Sep 16, 2015 at 10:21 PM, Daniel Povey dpovey@gmail.com wrote:

If you can fix the script yourself, please do via a pull request.

Dan

On Wed, Sep 16, 2015 at 10:18 PM, Xingyu Na notifications@github.com wrote:

Thanks. Now I use Vijay's updated scripts with reduced lrates (0.0002 and 0.00002). Options are "--optimization.min-deriv-time=0" and "--left-context=20 --right-context=7". iter0 and iter1 works fine, with prob_train of -3.98025 and -3.45653. The network explodes at iter2, but only on job train.2.1. After that, 3.1.raw is copied into 3.mdl (as shown in select.2.log), which is not right. I tested the selecting perl command and found that it was looking for "log-prob-per-frame=(\S+)" expression which is not a "to-be-parsed-by-a-script" variable in nnet3 (was in nnet2). So the script always select the first net regardless of "log-prob-per-frame".

On 09/17/2015 04:01 AM, Vijayaditya Peddinti wrote:

I verified that the script is selecting the model, but in @naxingyu https://github.com/naxingyu 's case all the models would have diverged. This is very probable as he was using |--left-context 40 | without |optimization.min-deriv-time |. This essentially means that the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe |optimization.min-deriv-time | is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

— Reply to this email directly or view it on GitHub

https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140870071.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140944421.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140945311.

danpovey commented 9 years ago

Thanks! Actually I think Vijay was spending some time tuning the max-change, but I'm not aware that he saw any improvement. In any case, he can let us know what the issue was. So it might be better if you just go straight to the scaling thing. If you make the pull request in https://github.com/kaldi-asr/kaldi/pull/143 which makes momentum=0.9 the default, I'm fairly confident that you will not get the catastrophic divergence that you are seeing. Dan

On Wed, Sep 16, 2015 at 10:31 PM, Xingyu Na notifications@github.com wrote:

OK. My plan is to fix this selecting issue first, then tune 'max-change' options as you suggested. After that, try the scaling/l2-regularizing stuff.

Best, Xingyu

On 09/17/2015 10:25 AM, Daniel Povey wrote:

BTW, to keep you abreast of the latest developments.

  • Vijay has been experimenting with a version of the code supporting momentum, in a pull request that I created. So far it doesn't seem to help our standard setup, but when I increase the learning rate a lot it does seem to prevent instability (but in those situations the likelihood still degrades, just not nearly as badly).
  • Right now I'm increasingly convinced that the problem is that the parameters are getting too large and the sigmoids are getting oversaturated. My proposed solution is to decrease the parameters slightly (e.g. by a factor of 0.95) on each iteration-- at least, in early epochs. An option could be added to nnet3-copy or nnet3-am-copy to scale the parameters using ScaleNnet(). I know this a bit odd, but it can be viewed as almost equivalent to applying l2 regularization. Do you have time to try this? I'm convinced it will help us get to a better place in parameter space.

Dan

On Wed, Sep 16, 2015 at 10:21 PM, Daniel Povey dpovey@gmail.com wrote:

If you can fix the script yourself, please do via a pull request.

Dan

On Wed, Sep 16, 2015 at 10:18 PM, Xingyu Na notifications@github.com wrote:

Thanks. Now I use Vijay's updated scripts with reduced lrates (0.0002 and 0.00002). Options are "--optimization.min-deriv-time=0" and "--left-context=20 --right-context=7". iter0 and iter1 works fine, with prob_train of -3.98025 and -3.45653. The network explodes at iter2, but only on job train.2.1. After that, 3.1.raw is copied into 3.mdl (as shown in select.2.log), which is not right. I tested the selecting perl command and found that it was looking for "log-prob-per-frame=(\S+)" expression which is not a "to-be-parsed-by-a-script" variable in nnet3 (was in nnet2). So the script always select the first net regardless of "log-prob-per-frame".

On 09/17/2015 04:01 AM, Vijayaditya Peddinti wrote:

I verified that the script is selecting the model, but in @naxingyu https://github.com/naxingyu 's case all the models would have diverged. This is very probable as he was using |--left-context 40 | without |optimization.min-deriv-time |. This essentially means that the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe |optimization.min-deriv-time | is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

— Reply to this email directly or view it on GitHub

https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140870071.

— Reply to this email directly or view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140944421 .

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140945311.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140946364.

danpovey commented 9 years ago

... I mean, if you make that pull request your starting point.

On Wed, Sep 16, 2015 at 10:34 PM, Daniel Povey dpovey@gmail.com wrote:

Thanks! Actually I think Vijay was spending some time tuning the max-change, but I'm not aware that he saw any improvement. In any case, he can let us know what the issue was. So it might be better if you just go straight to the scaling thing. If you make the pull request in https://github.com/kaldi-asr/kaldi/pull/143 which makes momentum=0.9 the default, I'm fairly confident that you will not get the catastrophic divergence that you are seeing. Dan

On Wed, Sep 16, 2015 at 10:31 PM, Xingyu Na notifications@github.com wrote:

OK. My plan is to fix this selecting issue first, then tune 'max-change' options as you suggested. After that, try the scaling/l2-regularizing stuff.

Best, Xingyu

On 09/17/2015 10:25 AM, Daniel Povey wrote:

BTW, to keep you abreast of the latest developments.

  • Vijay has been experimenting with a version of the code supporting momentum, in a pull request that I created. So far it doesn't seem to help our standard setup, but when I increase the learning rate a lot it does seem to prevent instability (but in those situations the likelihood still degrades, just not nearly as badly).
  • Right now I'm increasingly convinced that the problem is that the parameters are getting too large and the sigmoids are getting oversaturated. My proposed solution is to decrease the parameters slightly (e.g. by a factor of 0.95) on each iteration-- at least, in early epochs. An option could be added to nnet3-copy or nnet3-am-copy to scale the parameters using ScaleNnet(). I know this a bit odd, but it can be viewed as almost equivalent to applying l2 regularization. Do you have time to try this? I'm convinced it will help us get to a better place in parameter space.

Dan

On Wed, Sep 16, 2015 at 10:21 PM, Daniel Povey dpovey@gmail.com wrote:

If you can fix the script yourself, please do via a pull request.

Dan

On Wed, Sep 16, 2015 at 10:18 PM, Xingyu Na <notifications@github.com

wrote:

Thanks. Now I use Vijay's updated scripts with reduced lrates (0.0002 and 0.00002). Options are "--optimization.min-deriv-time=0" and "--left-context=20 --right-context=7". iter0 and iter1 works fine, with prob_train of -3.98025 and -3.45653. The network explodes at iter2, but only on job train.2.1. After that, 3.1.raw is copied into 3.mdl (as shown in select.2.log), which is not right. I tested the selecting perl command and found that it was looking for "log-prob-per-frame=(\S+)" expression which is not a "to-be-parsed-by-a-script" variable in nnet3 (was in nnet2). So the script always select the first net regardless of "log-prob-per-frame".

On 09/17/2015 04:01 AM, Vijayaditya Peddinti wrote:

I verified that the script is selecting the model, but in @naxingyu https://github.com/naxingyu 's case all the models would have diverged. This is very probable as he was using --left-context 40
without optimization.min-deriv-time . This essentially means that

the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe |optimization.min-deriv-time | is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

— Reply to this email directly or view it on GitHub

https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140870071.

— Reply to this email directly or view it on GitHub < https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140944421>.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140945311.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140946364.

danpovey commented 9 years ago

I just merged that pull request (but made 0.0 the default for momentum). Likely if you add --momentum 0.8 to the command line it will stabilize the update without affecting convergence. Dan

On Wed, Sep 16, 2015 at 10:40 PM, Daniel Povey dpovey@gmail.com wrote:

... I mean, if you make that pull request your starting point.

On Wed, Sep 16, 2015 at 10:34 PM, Daniel Povey dpovey@gmail.com wrote:

Thanks! Actually I think Vijay was spending some time tuning the max-change, but I'm not aware that he saw any improvement. In any case, he can let us know what the issue was. So it might be better if you just go straight to the scaling thing. If you make the pull request in https://github.com/kaldi-asr/kaldi/pull/143 which makes momentum=0.9 the default, I'm fairly confident that you will not get the catastrophic divergence that you are seeing. Dan

On Wed, Sep 16, 2015 at 10:31 PM, Xingyu Na notifications@github.com wrote:

OK. My plan is to fix this selecting issue first, then tune 'max-change' options as you suggested. After that, try the scaling/l2-regularizing stuff.

Best, Xingyu

On 09/17/2015 10:25 AM, Daniel Povey wrote:

BTW, to keep you abreast of the latest developments.

  • Vijay has been experimenting with a version of the code supporting momentum, in a pull request that I created. So far it doesn't seem to help our standard setup, but when I increase the learning rate a lot it does seem to prevent instability (but in those situations the likelihood still degrades, just not nearly as badly).
  • Right now I'm increasingly convinced that the problem is that the parameters are getting too large and the sigmoids are getting oversaturated. My proposed solution is to decrease the parameters slightly (e.g. by a factor of 0.95) on each iteration-- at least, in early epochs. An option could be added to nnet3-copy or nnet3-am-copy to scale the parameters using ScaleNnet(). I know this a bit odd, but it can be viewed as almost equivalent to applying l2 regularization. Do you have time to try this? I'm convinced it will help us get to a better place in parameter space.

Dan

On Wed, Sep 16, 2015 at 10:21 PM, Daniel Povey dpovey@gmail.com wrote:

If you can fix the script yourself, please do via a pull request.

Dan

On Wed, Sep 16, 2015 at 10:18 PM, Xingyu Na < notifications@github.com> wrote:

Thanks. Now I use Vijay's updated scripts with reduced lrates (0.0002 and 0.00002). Options are "--optimization.min-deriv-time=0" and "--left-context=20 --right-context=7". iter0 and iter1 works fine, with prob_train of -3.98025 and -3.45653. The network explodes at iter2, but only on job train.2.1. After that, 3.1.raw is copied into 3.mdl (as shown in select.2.log), which is not right. I tested the selecting perl command and found that it was looking for "log-prob-per-frame=(\S+)" expression which is not a "to-be-parsed-by-a-script" variable in nnet3 (was in nnet2). So the script always select the first net regardless of "log-prob-per-frame".

On 09/17/2015 04:01 AM, Vijayaditya Peddinti wrote:

I verified that the script is selecting the model, but in @naxingyu https://github.com/naxingyu 's case all the models would have diverged. This is very probable as he was using |--left-context 40 | without |optimization.min-deriv-time |. This essentially means that the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe |optimization.min-deriv-time | is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

— Reply to this email directly or view it on GitHub

<https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140870071 .

— Reply to this email directly or view it on GitHub < https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140944421>.

— Reply to this email directly or view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140945311 .

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140946364.

naxingyu commented 9 years ago

Got it! Thanks.

On 09/17/2015 10:50 AM, Daniel Povey wrote:

I just merged that pull request (but made 0.0 the default for momentum). Likely if you add --momentum 0.8 to the command line it will stabilize the update without affecting convergence. Dan

On Wed, Sep 16, 2015 at 10:40 PM, Daniel Povey dpovey@gmail.com wrote:

... I mean, if you make that pull request your starting point.

On Wed, Sep 16, 2015 at 10:34 PM, Daniel Povey dpovey@gmail.com wrote:

Thanks! Actually I think Vijay was spending some time tuning the max-change, but I'm not aware that he saw any improvement. In any case, he can let us know what the issue was. So it might be better if you just go straight to the scaling thing. If you make the pull request in https://github.com/kaldi-asr/kaldi/pull/143 which makes momentum=0.9 the default, I'm fairly confident that you will not get the catastrophic divergence that you are seeing. Dan

On Wed, Sep 16, 2015 at 10:31 PM, Xingyu Na notifications@github.com wrote:

OK. My plan is to fix this selecting issue first, then tune 'max-change' options as you suggested. After that, try the scaling/l2-regularizing stuff.

Best, Xingyu

On 09/17/2015 10:25 AM, Daniel Povey wrote:

BTW, to keep you abreast of the latest developments.

  • Vijay has been experimenting with a version of the code supporting momentum, in a pull request that I created. So far it doesn't seem to help our standard setup, but when I increase the learning rate a lot it does seem to prevent instability (but in those situations the likelihood still degrades, just not nearly as badly).
  • Right now I'm increasingly convinced that the problem is that the parameters are getting too large and the sigmoids are getting oversaturated. My proposed solution is to decrease the parameters slightly (e.g. by a factor of 0.95) on each iteration-- at least, in early epochs. An option could be added to nnet3-copy or nnet3-am-copy to scale the parameters using ScaleNnet(). I know this a bit odd, but it can be viewed as almost equivalent to applying l2 regularization. Do you have time to try this? I'm convinced it will help us get to a better place in parameter space.

Dan

On Wed, Sep 16, 2015 at 10:21 PM, Daniel Povey dpovey@gmail.com wrote:

If you can fix the script yourself, please do via a pull request.

Dan

On Wed, Sep 16, 2015 at 10:18 PM, Xingyu Na < notifications@github.com> wrote:

Thanks. Now I use Vijay's updated scripts with reduced lrates (0.0002 and 0.00002). Options are "--optimization.min-deriv-time=0" and "--left-context=20 --right-context=7". iter0 and iter1 works fine, with prob_train of -3.98025 and -3.45653. The network explodes at iter2, but only on job train.2.1. After that, 3.1.raw is copied into 3.mdl (as shown in select.2.log), which is not right. I tested the selecting perl command and found that it was looking for "log-prob-per-frame=(\S+)" expression which is not a "to-be-parsed-by-a-script" variable in nnet3 (was in nnet2). So the script always select the first net regardless of "log-prob-per-frame".

On 09/17/2015 04:01 AM, Vijayaditya Peddinti wrote:

I verified that the script is selecting the model, but in @naxingyu https://github.com/naxingyu 's case all the models would have diverged. This is very probable as he was using |--left-context 40 | without |optimization.min-deriv-time |. This essentially means that the model was getting updated with a gradient which has been back-propagated 50-60 time steps (depending on the frames_per_eg). Hence this gradient would have a high probability of explosion/vanishing.

In the updated recipe |optimization.min-deriv-time | is used to restrict back-prop to 20 time steps. So he might not see any of these problems. Further the left-context is also reduced to 20 (it was previously taken as 40 due to a bug).

— Reply to this email directly or view it on GitHub

<https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140870071 .

— Reply to this email directly or view it on GitHub < https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140944421>.

— Reply to this email directly or view it on GitHub

<https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140945311 .

— Reply to this email directly or view it on GitHub

https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140946364.

— Reply to this email directly or view it on GitHub https://github.com/kaldi-asr/kaldi/issues/134#issuecomment-140952363.

vijayaditya commented 9 years ago

Was offline for few hours, back to implementing the nnet-am-copy with scaling. Using lower max-change options didn't affect a great change (objf of -3.56 vs -3.54).