kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.11k stars 5.31k forks source link

Cholesky decomposition failed when training plda #3328

Open czy97 opened 5 years ago

czy97 commented 5 years ago

Hello, When I train plda using some feature I extracted, this error occured.The tail of log is shown below

LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 146608
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 209565
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 5 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 140.852
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 197.8
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 6 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 105.141
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 157.448
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 7 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 1117.95
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 5858.14
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 8 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 140.276
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 395.243
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 9 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 12926.4
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 12056.3
LOG (ivector-compute-plda[5.5.0~3-327d]:GetOutput():plda.cc:540) Norm of mean of iVector distribution is 0.745405
WARNING (ivector-compute-plda[5.5.0~3-327d]:Cholesky():tp-matrix.cc:110) Cholesky decomposition failed. Maybe matrix is not positive definite. Throwing error
Cholesky decomposition failed.# Accounting: begin_time=1558008300

And when I add some very small random noise to my feature like you said in the LDA computing when facing the same problem, the error was still there.
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 7 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 101755
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 141957
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 8 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 3136.84
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 17124.3
LOG (ivector-compute-plda[5.5.0~3-327d]:Estimate():plda.cc:529) Plda estimation iteration 9 of 10
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:511) Trace of within-class variance is 1.78053e+08
LOG (ivector-compute-plda[5.5.0~3-327d]:EstimateFromStats():plda.cc:512) Trace of between-class variance is 1.22304e+08
LOG (ivector-compute-plda[5.5.0~3-327d]:GetOutput():plda.cc:540) Norm of mean of iVector distribution is 0.745405
WARNING (ivector-compute-plda[5.5.0~3-327d]:Cholesky():tp-matrix.cc:110) Cholesky decomposition failed. Maybe matrix is not positive definite. Throwing error
Cholesky decomposition failed.# Accounting: begin_time=1558007075
danpovey commented 5 years ago

Are you sure your features aren't limited to a subspace of the space they live in? Or maybe you have fewer features than the PLDA dimension?

czy97 commented 5 years ago

Are you sure your features aren't limited to a subspace of the space they live in? Or maybe you have fewer features than the PLDA dimension?

Thanks for your reply. My feature dimension is 256, and more than 1 million features are used to train the plda. Moreover, what do you mean by saying my features aren't limited to a subspace of the space they live in.

danpovey commented 5 years ago

I mean something like the feature values sum to one, so the covariance would not be full rank. Or one feature is always zero, something like that.

On Thu, May 16, 2019 at 9:06 PM Zhengyang Chen notifications@github.com wrote:

Are you sure your features aren't limited to a subspace of the space they live in? Or maybe you have fewer features than the PLDA dimension?

Thanks for your reply. My feature dimension is 256, and more than 1 million features are used to train the plda. Moreover, what do you mean by saying my features aren't limited to a subspace of the space they live in.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3328?email_source=notifications&email_token=AAZFLOYRESISZAPF37AXTGDPVYAJVA5CNFSM4HNL3RO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTOGGA#issuecomment-493282072, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZFLO6JJQZ2ZRGYLRJI7PTPVYAJVANCNFSM4HNL3ROQ .

czy97 commented 5 years ago

I mean something like the feature values sum to one, so the covariance would not be full rank. Or one feature is always zero, something like that. On Thu, May 16, 2019 at 9:06 PM Zhengyang Chen @.***> wrote: Are you sure your features aren't limited to a subspace of the space they live in? Or maybe you have fewer features than the PLDA dimension? Thanks for your reply. My feature dimension is 256, and more than 1 million features are used to train the plda. Moreover, what do you mean by saying my features aren't limited to a subspace of the space they live in. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3328?email_source=notifications&email_token=AAZFLOYRESISZAPF37AXTGDPVYAJVA5CNFSM4HNL3RO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVTOGGA#issuecomment-493282072>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZFLO6JJQZ2ZRGYLRJI7PTPVYAJVANCNFSM4HNL3ROQ .

Ok, thanks. You can see that the trace of within/between-class variance get very big at the iteration 9 of 10. If I set the parameter --num-em-iters to 9 instead of 10, the error will disappear(the within/between-class is small after 9 iters). And when I check some normal logs of plda training, the within/between-class variance always keeps at a relatively low value(around one hundred). Does it mean that the EM algorithm not converge well? So, is it my data's fault or other cause.

danpovey commented 5 years ago

I'll try to look into it at some point.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

shayxurui commented 3 years ago

When I use the running script(run.sh) of aishell2 in Kaldi's egs, I get an error when I run it to steps/online/nnet2/train_ivector_extractor.sh, and the log file says cholesky decomposition failed. Maybe matrix is not positive definite. I did not modify any data or code.

danpovey commented 3 years ago

Sometimes that error is harmless, anyway it's quite generic, would need to see more info (e.g. more of the log).

On Tue, Sep 22, 2020 at 9:51 AM 徐锐 notifications@github.com wrote:

When I use the running script(run.sh) of aishell2 in Kaldi's egs, I get an error when I run it to steps/online/nnet2/train_ivector_extractor.sh, and the log file says cholesky decomposition failed. Maybe matrix is not positive definite. I did not modify any data or code.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3328#issuecomment-696469393, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3RJVLAHMVYXQYL5VDSG77INANCNFSM4HNL3ROQ .

shayxurui commented 3 years ago

Sometimes that error is harmless, anyway it's quite generic, would need to see more info (e.g. more of the log). On Tue, Sep 22, 2020 at 9:51 AM 徐锐 @.***> wrote: When I use the running script(run.sh) of aishell2 in Kaldi's egs, I get an error when I run it to steps/online/nnet2/train_ivector_extractor.sh, and the log file says cholesky decomposition failed. Maybe matrix is not positive definite. I did not modify any data or code. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3328 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3RJVLAHMVYXQYL5VDSG77INANCNFSM4HNL3ROQ .

I am very sorry, I re-executed the code, the log file has been overwritten. A new error was encountered, still the train_ivetcor_extractor.sh script and log says expected token "",got instead "BLAS"

danpovey commented 3 years ago

You should learn to paste as text. My guess is that probably you ran out of memory, that part can use up a great deal of memory. You could reduce the --num-processes to 1, to train_ivector_extractor.sh, that should help, and/or reduce the --num-jobs too if you are using run.pl

On Wed, Sep 23, 2020 at 9:56 AM 徐锐 notifications@github.com wrote:

Sometimes that error is harmless, anyway it's quite generic, would need to see more info (e.g. more of the log). … <#m4403595919129091908> On Tue, Sep 22, 2020 at 9:51 AM 徐锐 @.***> wrote: When I use the running script(run.sh) of aishell2 in Kaldi's egs, I get an error when I run it to steps/online/nnet2/train_ivector_extractor.sh, and the log file says cholesky decomposition failed. Maybe matrix is not positive definite. I did not modify any data or code. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3328 (comment) https://github.com/kaldi-asr/kaldi/issues/3328#issuecomment-696469393>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3RJVLAHMVYXQYL5VDSG77INANCNFSM4HNL3ROQ .

I am very sorry, I re-executed the code, the log file has been overwritten. A new error was encountered, [image: image] https://user-images.githubusercontent.com/30276311/93955591-abe97100-fd82-11ea-96f3-89634406df86.png

and log

[image: 微信图片_20200923095518] https://user-images.githubusercontent.com/30276311/93955706-f66aed80-fd82-11ea-85dd-30ffe91ff9e8.png image

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/3328#issuecomment-697075380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3RLWAOBLFUQY22PWTSHFIVLANCNFSM4HNL3ROQ .

shayxurui commented 3 years ago

You should learn to paste as text. My guess is that probably you ran out of memory, that part can use up a great deal of memory. You could reduce the --num-processes to 1, to train_ivector_extractor.sh, that should help, and/or reduce the --num-jobs too if you are using run.pl On Wed, Sep 23, 2020 at 9:56 AM 徐锐 @.> wrote: Sometimes that error is harmless, anyway it's quite generic, would need to see more info (e.g. more of the log). … <#m4403595919129091908> On Tue, Sep 22, 2020 at 9:51 AM 徐锐 @.> wrote: When I use the running script(run.sh) of aishell2 in Kaldi's egs, I get an error when I run it to steps/online/nnet2/train_ivector_extractor.sh, and the log file says cholesky decomposition failed. Maybe matrix is not positive definite. I did not modify any data or code. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3328 (comment) <#3328 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3RJVLAHMVYXQYL5VDSG77INANCNFSM4HNL3ROQ . I am very sorry, I re-executed the code, the log file has been overwritten. A new error was encountered, [image: image] https://user-images.githubusercontent.com/30276311/93955591-abe97100-fd82-11ea-96f3-89634406df86.png and log [image: 微信图片_20200923095518] https://user-images.githubusercontent.com/30276311/93955706-f66aed80-fd82-11ea-85dd-30ffe91ff9e8.png image — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3328 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3RLWAOBLFUQY22PWTSHFIVLANCNFSM4HNL3ROQ .

thank you for your reply. I want to log out all the information, but I cannot copy the text from the virtual machine because the company has restricted it.

danpovey commented 3 years ago

Likely something in your system is printing 'BLAS' to stdout every time a shell is created, e.g. in one of your .xxxrc files. Either that or (somehow) when a certain BLAS library gets loaded it prints BLAS.

shayxurui commented 3 years ago

Likely something in your system is printing 'BLAS' to stdout every time a shell is created, e.g. in one of your .xxxrc files. Either that or (somehow) when a certain BLAS library gets loaded it prints BLAS.

When I delete the relevant information of ivector and train again, there is no error. It seems that training aishell2 does not necessarily require ivector.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.