except Error when extract ivectors

YihengJiang commented 5 years ago

when I run code: "sid/extract_ivectors.sh ", I have met a problem that did not raise at before. I try to solve it but it seems not work. At last, I found that there were some enrollment utterance files which can not perform rightly in "gmm-gselect" command line (this is a code in "sid/extract_ivectors.sh"). It indicates that these files may be wrong but I don't know where are them wrong? However, there is no problem in any other files includes test set and plda training set. I post the error infomation which was run on a wrong enrollment file: Hope your help, professor. Thank you!

.....(something running normally)


ASSERTION_FAILED (gmm-gselect[5.4.185~1-2fa7]:GaussianSelection():diag-gmm.cc:828) : '!output->back().empty()' 

[ Stack-Trace: ]
gmm-gselect() [0x5c52ba]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)
kaldi::DiagGmm::GaussianSelection(kaldi::MatrixBase<float> const&, int, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >*) const
main
__libc_start_main
_start

WARNING (fgmm-global-gselect-to-post[5.4.185~1-2fa7]:main():fgmm-global-gselect-to-post.cc:84) No gselect information for utterance 03665_dvlsm
LOG (fgmm-global-gselect-to-post[5.4.185~1-2fa7]:main():fgmm-global-gselect-to-post.cc:148) Done 0 files; 1 had errors.
LOG (fgmm-global-gselect-to-post[5.4.185~1-2fa7]:main():fgmm-global-gselect-to-post.cc:149) Overall loglike per frame is -nan with -nan entries per frame,  over 0 frames
LOG (scale-post[5.4.185~1-2fa7]:main():scale-post.cc:79) Done 0 posteriors;  0 had no scales.
LOG (ivector-extract[5.4.185~1-2fa7]:ComputeDerivedVars():ivector-extractor.cc:204) Done.

.....(something running normally)

YihengJiang commented 5 years ago

Note: I have checked the file, and there is no problem after cmvn and vad, the file I checked has remained 26000+ frames

danpovey commented 5 years ago

That code hasn't been changed since 2013 so it's kind of surprising that there would be a problem now.
At diag-gmm.cc can you introduce at line 841: KALDI_ASSERT(loglikes.Sum() == loglikes.Sum()); which is a check for NaN?

YihengJiang commented 5 years ago

That code hasn't been changed since 2013 so it's kind of surprising that there would be a problem now. At diag-gmm.cc can you introduce at line 841: KALDI_ASSERT(loglikes.Sum() == loglikes.Sum()); which is a check for NaN?

Thanks for your response, The NaN value is caused by only one file(this file couldn't perform rightly in code running) I have used for testing this error,In fact, it is not NaN value when I run with my data set normally. The main problem is

ASSERTION_FAILED (gmm-gselect[5.4.185~1-2fa7]:GaussianSelection():diag-gmm.cc:828) : '!output->back().empty()'

I think maybe I can add some noise and then look what would be happened, but I do not sure it will work.

danpovey commented 5 years ago

Try to figure out where the nan creeps in, e.g is it in the MFCC computation somehow? We might need to fix that somehow. But I would have thought it would already avoid taking the log of zero.

On Mon, Nov 26, 2018 at 8:19 PM YihengJiang notifications@github.com wrote:

That code hasn't been changed since 2013 so it's kind of surprising that there would be a problem now. At diag-gmm.cc can you introduce at line 841: KALDI_ASSERT(loglikes.Sum() == loglikes.Sum()); which is a check for NaN?

Thanks for your response, The NaN value is caused by only one file(this file couldn't perform rightly in code running) I have used for testing this error,In fact, it is not NaN value when I run with my data set normally. The main problem is

ASSERTION_FAILED (gmm-gselect[5.4.185~1-2fa7]:GaussianSelection():diag-gmm.cc:828) : '!output->back().empty()'

I think maybe I can add some noise and then look what would be happened, but I do not sure it will work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2870#issuecomment-441861356, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu8k60Avf4KhZJVPeeTahllJTvfngks5uzJMHgaJpZM4YzFYI .

kkm000 commented 5 years ago

@YihengJiang, just following up, did you happen to spot where the NaN was coming from? It would be helpful. Looks like we have an edge case somewhere.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tryitforever commented 4 years ago

@kkm000 If I want find where the NaN was coming from, I just need to find "NaN" or other string from the ark files?

[WARNING (nnet3-train[5.5]:ReorthogonalizeRt1():natural-gradient-online.cc:248) Cholesky or Invert() failed while re-orthogonalizing R_t. Re-orthogonalizing on CPU. ASSERTION_FAILED (nnet3-train[5.5]:HouseBackward():qr.cc:124) Assertion failed: (KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs.")](url)

tryitforever commented 4 years ago

@kkm000 I met this problem when train a Xvector DNN by run run_xvector_1a.sh

train.0.2.log

kkm000 commented 4 years ago

@tryitforever, a thing that stands out is absent Git revision in the log. Usually, it looks like WARNING (nnet3-train[5.5~g1234567]:Re. You have WARNING (nnet3-train[5.5]:.... The~gpart is missing. How did you build your Kaldi? What is the platform (Mac, or which linux?). Please give the configure line (it's in a comment at the beginning of generatedsrc/kaldi.mk), and the output ofuname -a`.

I never did any xvector model, @danpovey, need your help. The log name is train.0.2, so it happened on the first iteration. There are a few Cholesky failures thrown and caught during training in Cholesky(), but the part leading to the final rapid unplanned disassembly is

WARNING (nnet3-train[5.5]:ReorthogonalizeRt1():natural-gradient-online.cc:248) Cholesky or Invert() failed while re-orthogonalizing R_t. Re-orthogonalizing on CPU.
WARNING (nnet3-train[5.5]:ReorthogonalizeRt1():natural-gradient-online.cc:241) Cholesky out of expected range, reorthogonalizing with Gram-Schmidt
ERROR (nnet3-train[5.5]:Cholesky():tp-matrix.cc:110) Cholesky decomposition failed. Maybe matrix is not positive definite.

[ Stack-Trace: ]
/home/liumin/kaldi-master/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f376da1d2ca]
nnet3-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x40884b]
/home/liumin/kaldi-master/src/lib/libkaldi-matrix.so(kaldi::TpMatrix<float>::Cholesky(kaldi::SpMatrix<float> const&)+0x1b1) [0x7f376dc83c73]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::ReorthogonalizeRt1(kaldi::VectorBase<float> const&, float, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x3a4) [0x7f376f113cc4]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0xfb2) [0x7f376f1151de]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x1c2) [0x7f376f115f34]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x214) [0x7f376f0cf556]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::AffineComponent::Backprop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::ComponentPrecomputedIndexes const*, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&, void*, kaldi::nnet3::Component*, kaldi::CuMatrixBase<float>*) const+0xa2) [0x7f376f0ccc0c]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x87c) [0x7f376f16348a]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f376f1641fe]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetTrainer::TrainInternal(kaldi::nnet3::NnetExample const&, kaldi::nnet3::NnetComputation const&)+0x76) [0x7f376f18a80e]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetTrainer::Train(kaldi::nnet3::NnetExample const&)+0x17b) [0x7f376f18ac29]
nnet3-train(main+0x5f6) [0x407b4c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f376ceb8830]
nnet3-train(_start+0x29) [0x407489]

WARNING (nnet3-train[5.5]:ReorthogonalizeRt1():natural-gradient-online.cc:248) Cholesky or Invert() failed while re-orthogonalizing R_t. Re-orthogonalizing on CPU.
ASSERTION_FAILED (nnet3-train[5.5]:HouseBackward():qr.cc:124) Assertion failed: (KALDI_ISFINITE(sigma) && "Tridiagonalizing matrix that is too large or has NaNs.")

[ Stack-Trace: ]
/home/liumin/kaldi-master/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f376da1d2ca]
/home/liumin/kaldi-master/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x6c) [0x7f376da1dd38]
/home/liumin/kaldi-master/src/lib/libkaldi-matrix.so(void kaldi::HouseBackward<float>(int, float const*, float*, float*)+0x131) [0x7f376dc88487]
/home/liumin/kaldi-master/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Tridiagonalize(kaldi::MatrixBase<float>*)+0x147) [0x7f376dc887db]
/home/liumin/kaldi-master/src/lib/libkaldi-matrix.so(kaldi::SpMatrix<float>::Eig(kaldi::VectorBase<float>*, kaldi::MatrixBase<float>*) const+0xa7) [0x7f376dc8a057]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*)+0x9e7) [0x7f376f114c13]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*)+0x1c2) [0x7f376f115f34]
/home/liumin/kaldi-master/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NaturalGradientAffineComponent::Update(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::CuMatrixBase<float> const&, kaldi::CuMatrixBase<float> const&)+0x214) [0x7f376f0cf556]
. . . .

kkm000 commented 4 years ago

BTW, it's a different error in a different place than originally reported.

tryitforever commented 4 years ago

OK, thanks for your reply! @kkm000 @danpovey I work on Unbuntu, the output of 'uname -a' is "Linux DEV 4.4.0-112-generic #135-Ubuntu". I attached the src/kaldi.mk and my run shell script. shell.zip

I've been trying to run this egs of Voxceleb.I have downloaded the data sets from Oxford University's website and the pretrained model from the "http://kaldi-asr.org/" website. Now I'm retraining DNN with the data set, the aim is to simplify the network model and improve the recognition efficiency.

Now the problem is that there was an error when running train_raw_dnn.py at stage of 8. I sent you the screenshot and then attached the corresponding log file. I hope to get your advice.Thank you very much!

err train.0.1.log train.0.2.log train.0.3.log

tryitforever commented 4 years ago

oh, the version of configure is 11, I downloaded the zip from github on May 20.

danpovey commented 4 years ago

Infinity was generated, this usually has to do with some kind of instability. Probably something to do with the nnet topology, hard to say without lots more details but I dont have much time for that.

On Fri, Aug 21, 2020 at 10:04 AM tryitforever notifications@github.com wrote:

oh, the version of configure is 11

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2870#issuecomment-677996351, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6GFOOH4AZIIKLWTXDSBXIZZANCNFSM4GGMKYEA .

kkm000 commented 4 years ago

I downloaded the zip from github on May 20.

Aha, that explains why there is not a Git revision number.

As for the numeric instability, I cannot help either. Look at the logs. you are hitting the max-change harder and harder on every iteration, until the scale factor reaches the whopping 10^{-19}. It is normal on the first iterations to hit these limit, but the value of 1/scale looks too large to me, and grows steadily. Thia indicates a fast divergence. Try usual tricks, such as reducing the LR, and then playing with the momentum. Relu-batchnorm layer spaces should not have cliffs or ravines, but I have no intuition of this 3-layer contraption of hyperembedding, statpooling and then projecting back to the same dimensionality. Try running the first iteration with 1/5 the LR, with the momentum either 0 or doubled.

But you may have faulty features, too.

Oh, on more thing: i've seen 3 GPU in one of the logs. If you have less than 8 GPUs in this machine, change --use-gpu from 'true' to 'wait', or training will fail later when you reach 4 jobs.

I suggest you post a question to the kaldi-help group if you don't advance. Someone who has experience with this network/model may give you a much better advice.

I am closing this issue for now. If you believe that your issue has not been addressed, please feel free to ping me, and I'll reopen it. @-mention me for a faster response!

tryitforever commented 4 years ago

yes, I have only 3 GPUs, need I change the num of nj ? @kkm000

kkm000 commented 4 years ago

I told you what to change: --use-gpu from 'true' to 'wait'. After you solve your current problem and start ramping up from 3 jobs up, the 4th job will either abort for the lack of available GPU (with 'true') or wait until a GPU is available ('wait).

Please use the kaldi-help list. You'll really get more help there from people who did run this recipe or had run into a similar problem. Instructions are here. If you cannot subscribe on the Web, send an empty e-mail with subject line 'Kaldi' to kaldi-help+subscribe@googlegroups.com, then reply to an automatically sent e-mail you'll get in a couple minutes.

kaldi-asr / kaldi

except Error when extract ivectors #2870