srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
822 stars 342 forks source link

Token Accuracy Drops Obj(log[Pzx])=nan #87

Open behnamasefi opened 8 years ago

behnamasefi commented 8 years ago

Hi all,

We have an issue with training more than 9000 Hrs of our speech data. We use the train_ctc_parallel.sh recipe with num_seq=10 and frame_num_limit=12500. However after training the 183 hours of data token accuracy drops drastically and Obj(log[Pzx])=nan

Does anyone have any pointers to solve this? Thanks

VLOG1 After 201933 sequences (163.131Hr): Obj(log[Pzx]) = -41.6014 TokenAcc = 59.2475% VLOG1 After 202939 sequences (163.96Hr): Obj(log[Pzx]) = -42.6817 TokenAcc = 58.5093% VLOG1 After 203946 sequences (164.793Hr): Obj(log[Pzx]) = -42.561 TokenAcc = 59.4519% VLOG1 After 204951 sequences (165.623Hr): Obj(log[Pzx]) = -43.8504 TokenAcc = 57.6078% VLOG1 After 205951 sequences (166.449Hr): Obj(log[Pzx]) = -1e+27 TokenAcc = 58.6802% VLOG1 After 206958 sequences (167.246Hr): Obj(log[Pzx]) = -40.0838 TokenAcc = 59.9717% VLOG1 After 207965 sequences (168.075Hr): Obj(log[Pzx]) = -41.7963 TokenAcc = 60.6595% VLOG1 After 208967 sequences (168.907Hr): Obj(log[Pzx]) = -43.781 TokenAcc = 58.9324% VLOG1 After 209976 sequences (169.669Hr): Obj(log[Pzx]) = -37.7623 TokenAcc = 60.3672% VLOG1 After 210979 sequences (170.439Hr): Obj(log[Pzx]) = -39.8231 TokenAcc = 58.9923% VLOG1 After 211979 sequences (171.212Hr): Obj(log[Pzx]) = -1e+27 TokenAcc = 58.5249% VLOG1 After 212985 sequences (172.007Hr): Obj(log[Pzx]) = -39.8087 TokenAcc = 61.0912% VLOG1 After 213992 sequences (172.773Hr): Obj(log[Pzx]) = -39.9797 TokenAcc = 59.2739% VLOG1 After 214996 sequences (173.55Hr): Obj(log[Pzx]) = -9.96016e+26 TokenAcc = 59.5201% VLOG1 After 216005 sequences (174.351Hr): Obj(log[Pzx]) = -38.7279 TokenAcc = 61.0642% VLOG1 After 217009 sequences (175.156Hr): Obj(log[Pzx]) = -40.5695 TokenAcc = 60.692% VLOG1 After 218017 sequences (175.937Hr): Obj(log[Pzx]) = -38.693 TokenAcc = 59.3732% VLOG1 After 219017 sequences (176.733Hr): Obj(log[Pzx]) = -40.4078 TokenAcc = 60.4888% VLOG1 After 220017 sequences (177.531Hr): Obj(log[Pzx]) = -40.1273 TokenAcc = 60.5241% VLOG1 After 221017 sequences (178.363Hr): Obj(log[Pzx]) = -43.3316 TokenAcc = 59.4678% VLOG1 After 222020 sequences (179.173Hr): Obj(log[Pzx]) = -42.5057 TokenAcc = 59.6424% VLOG1 After 223023 sequences (180.003Hr): Obj(log[Pzx]) = -40.4493 TokenAcc = 60.753% VLOG1 After 224025 sequences (180.791Hr): Obj(log[Pzx]) = -40.4711 TokenAcc = 59.7552% VLOG1 After 225032 sequences (181.603Hr): Obj(log[Pzx]) = -39.235 TokenAcc = 60.5423% VLOG1 After 226035 sequences (182.43Hr): Obj(log[Pzx]) = -42.8734 TokenAcc = 59.7315% VLOG1 After 227041 sequences (183.209Hr): Obj(log[Pzx]) = -38.7436 TokenAcc = 60.1154% VLOG1 After 228042 sequences (183.989Hr): Obj(log[Pzx]) = nan TokenAcc = 36.4706% VLOG1 After 229048 sequences (184.82Hr): Obj(log[Pzx]) = nan TokenAcc = 2.26617% VLOG1 After 230053 sequences (185.631Hr): Obj(log[Pzx]) = nan TokenAcc = 2.17376% VLOG1 After 231054 sequences (186.464Hr): Obj(log[Pzx]) = nan TokenAcc = 2.23086% VLOG1 After 232055 sequences (187.262Hr): Obj(log[Pzx]) = nan TokenAcc = 2.24668% VLOG1 After 233062 sequences (188.122Hr): Obj(log[Pzx]) = nan TokenAcc = 2.20169% VLOG1 After 234063 sequences (188.93Hr): Obj(log[Pzx]) = nan TokenAcc = 2.36458%

fmetze commented 8 years ago

Hi, you have abnormal Objs even before (-1e+27, …), which indicates exploding gradients or similar problems. Are you using projection layers? They are more sensitive. If not, you could try reducing other parameters such as the clipping values that are written in the definition of the neural networks. If it is always the same utterances causing the problem, removing these utterances can help (but it is not really a solution). Hope this helps? Best, F.

On Aug 23, 2016, at 4:57 AM, behnamasefi notifications@github.com wrote:

Hi all,

We have an issue with training more than 9000 Hrs of our speech data. We use the train_ctc_parallel.sh recipe with num_seq=10 and frame_num_limit=12500. However after training the 183 hours of data token accuracy drops drastically and Obj(log[Pzx])=nan

Does anyone have any pointers to solve this? Thanks

VLOG1 After 201933 sequences (163.131Hr): Obj(log[Pzx]) = -41.6014 TokenAcc = 59.2475% VLOG1 After 202939 sequences (163.96Hr): Obj(log[Pzx]) = -42.6817 TokenAcc = 58.5093% VLOG1 After 203946 sequences (164.793Hr): Obj(log[Pzx]) = -42.561 TokenAcc = 59.4519% VLOG1 After 204951 sequences (165.623Hr): Obj(log[Pzx]) = -43.8504 TokenAcc = 57.6078% VLOG1 After 205951 sequences (166.449Hr): Obj(log[Pzx]) = -1e+27 TokenAcc = 58.6802% VLOG1 After 206958 sequences (167.246Hr): Obj(log[Pzx]) = -40.0838 TokenAcc = 59.9717% VLOG1 After 207965 sequences (168.075Hr): Obj(log[Pzx]) = -41.7963 TokenAcc = 60.6595% VLOG1 After 208967 sequences (168.907Hr): Obj(log[Pzx]) = -43.781 TokenAcc = 58.9324% VLOG1 After 209976 sequences (169.669Hr): Obj(log[Pzx]) = -37.7623 TokenAcc = 60.3672% VLOG1 After 210979 sequences (170.439Hr): Obj(log[Pzx]) = -39.8231 TokenAcc = 58.9923% VLOG1 After 211979 sequences (171.212Hr): Obj(log[Pzx]) = -1e+27 TokenAcc = 58.5249% VLOG1 After 212985 sequences (172.007Hr): Obj(log[Pzx]) = -39.8087 TokenAcc = 61.0912% VLOG1 After 213992 sequences (172.773Hr): Obj(log[Pzx]) = -39.9797 TokenAcc = 59.2739% VLOG1 After 214996 sequences (173.55Hr): Obj(log[Pzx]) = -9.96016e+26 TokenAcc = 59.5201% VLOG1 After 216005 sequences (174.351Hr): Obj(log[Pzx]) = -38.7279 TokenAcc = 61.0642% VLOG1 After 217009 sequences (175.156Hr): Obj(log[Pzx]) = -40.5695 TokenAcc = 60.692% VLOG1 After 218017 sequences (175.937Hr): Obj(log[Pzx]) = -38.693 TokenAcc = 59.3732% VLOG1 After 219017 sequences (176.733Hr): Obj(log[Pzx]) = -40.4078 TokenAcc = 60.4888% VLOG1 After 220017 sequences (177.531Hr): Obj(log[Pzx]) = -40.1273 TokenAcc = 60.5241% VLOG1 After 221017 sequences (178.363Hr): Obj(log[Pzx]) = -43.3316 TokenAcc = 59.4678% VLOG1 After 222020 sequences (179.173Hr): Obj(log[Pzx]) = -42.5057 TokenAcc = 59.6424% VLOG1 After 223023 sequences (180.003Hr): Obj(log[Pzx]) = -40.4493 TokenAcc = 60.753% VLOG1 After 224025 sequences (180.791Hr): Obj(log[Pzx]) = -40.4711 TokenAcc = 59.7552% VLOG1 After 225032 sequences (181.603Hr): Obj(log[Pzx]) = -39.235 TokenAcc = 60.5423% VLOG1 After 226035 sequences (182.43Hr): Obj(log[Pzx]) = -42.8734 TokenAcc = 59.7315% VLOG1 After 227041 sequences (183.209Hr): Obj(log[Pzx]) = -38.7436 TokenAcc = 60.1154% VLOG1 After 228042 sequences (183.989Hr): Obj(log[Pzx]) = nan TokenAcc = 36.4706% VLOG1 After 229048 sequences (184.82Hr): Obj(log[Pzx]) = nan TokenAcc = 2.26617% VLOG1 After 230053 sequences (185.631Hr): Obj(log[Pzx]) = nan TokenAcc = 2.17376% VLOG1 After 231054 sequences (186.464Hr): Obj(log[Pzx]) = nan TokenAcc = 2.23086% VLOG1 After 232055 sequences (187.262Hr): Obj(log[Pzx]) = nan TokenAcc = 2.24668% VLOG1 After 233062 sequences (188.122Hr): Obj(log[Pzx]) = nan TokenAcc = 2.20169% VLOG1 After 234063 sequences (188.93Hr): Obj(log[Pzx]) = nan TokenAcc = 2.36458%

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/87, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8QKLaiePZfinjkoN_M9x6FM0NdRXks5qirX5gaJpZM4Jqsla.

Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University

behnamasefi commented 8 years ago

Thanks for your response. We don't use projection layer in NNet architecture. However, I will try to reduce clipping values and tries other training. Meanwhile, how we can know which utterance (or utterances) is processed on training time? Are there any logs for this?

Best,

Behnam.

yajiemiao commented 8 years ago

If it still fails, try to reduce --frame-num-limit as it affects the deltas of parameters.

behnamasefi commented 8 years ago

Thanks for your comments. I started new model training with smaller parameters value and everything seems Ok.

double22a commented 8 years ago

@behnamasefi smaller parameters value means fram_num_limit?

zhangjiulong commented 8 years ago

@behnamasefi I got the same problem but I reduced the --frame-num-limit value and the problem is not solved

fmetze commented 8 years ago

There is no easy way to log the utterance id at training time, but the utterances are processed in the order in which they are in the feature file, so you can remove lines 204951 to 205951 and see if the problem goes away. It is not a great solution, but we’ve used it in the past to find utterances where the transcriptions were completely off (for example, something that had more symbols than frames and thus could not be aligned).

On Aug 31, 2016, at 3:07 AM, john notifications@github.com wrote:

@behnamasefi https://github.com/behnamasefi I got the same problem but I reduced the --frame-num-limit value and the problem is not solved

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/87#issuecomment-243677475, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8aP1qPgbPf9h0beSAKKtnVa-lPcCks5qlShMgaJpZM4Jqsla.

Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University

bmilde commented 7 years ago

It is not a great solution, but we’ve used it in the past to find utterances where the transcriptions were completely off (for example, something that had more symbols than frames and thus could not be aligned).

Isn't checking "symbols count (+blanks) > frames count" something that could easily be automated while training? If there is no such check yet in the code I could look into adding one.

I'm experiencing similar problems with a 1000h German corpus, though there doesn't always seem to be definitive pattern to it as it will sometimes run through with the occasional Obj(log[Pzx]) = -1e+30, without hitting nan in the end.

fmetze commented 7 years ago

Yes, we are using the following script in some train_ctc_parallel.sh variant that we are using. It is quite a simple check.

if $sort_by_len; then td=$(mktemp -d) feat-to-len scp:$data_tr/feats.scp ark,t:- | awk '{print $2}' > $td/len.tmp || exit 1; gzip -cd $dir/labels.tr.gz | paste -d" " $td/len.tmp $data_tr/feats.scp - | sort -gk 1 | \ awk '{out=""; for (i=5;i<=NF;i++) {out=out" "$i}; if (!(out in done) && $1 > 3*NF) {done[out]=1; print $2 " " $3}}' > $dir/train.scp rm -rf $td feat-to-len scp:$data_cv/feats.scp ark,t:- | awk '{print $2}' | \ paste -d " " $data_cv/feats.scp - | sort -k3 -n - | awk '{print $1 " " $2}' > $dir/cv.scp || exit 1; else

Karel Vesely made some changes to improve robustness (in https://github.com/vesis84/eesen) which also improve robustness. I have not yet had the time to look into these changes, but they might help resolve your problems as well. Do you want to look into this, maybe? We also have recently checked some code into the master branch that allows you to see the alignment that you get (on training/ validation data) - maybe this will also help diagnose any such problems?

bmilde commented 7 years ago

Thanks! Ultimately I've added this to train-ctc-parallel.cc, after the check for too long sequences ("has too many frames; ignoring:") :

        // Check that we have enough frames to align them to the targets
        if (mat.NumRows() <= targets.size()*2+1) {
          KALDI_WARN << utt << ", has not enough frames to align them to its targets; ignoring: no. frames " << mat.NumRows() << " < no. targets " << targets.size();
          continue;
        }

I guess it would be better to have this there, as it makes the training less error prone (and offending sequences are also in the log then).

Edit: I corrected the code above. Apparently, blank labels are added later when the gradients are computed and these were unaccounted in my check (in ctc-loss.cc Eval or EvalParallel.). Also, to be on the safe side of things, I'm filtering out the edge case where both sequences have the same length. I hope the check is correct now!

ericbolo commented 7 years ago

This is related to the accuracy drop. I was getting nan values and abnormally high values for Obj(log(p(x|z))) while training the model on tedlium (v1).

Reducing the number of layers to 4, and frame-limit to 10000, fixes the initial instability and yields decent results after 18 epochs, WER=23% on test set.

Maybe this is useful to someone.