srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
824 stars 342 forks source link

WER vs. Token Accuracy #123

Closed branislavpopovic closed 7 years ago

branislavpopovic commented 7 years ago

Hello. We have obtained more than 90% (token) accuracy on our validation set. If I understood correctly, tokens are basically phonemes, plus disambiguation symbols and blanks (please, correct me if I'm wrong)? On the other hand, when we try to decode, we get WER of more than 15%. If we try to increase our language model weight, we get more than 50%. We are using a dictionary with more than 120000 words, and a trigram language model (for the Serbian language), with the default parameters. With such a large token accuracy, we were expecting much better WER (even with phoneme error rate of less than 60% we have obtained higher word recognition accuracy on some other systems). Can you please tell us how is that possible?

Thank you in advance.

riebling commented 7 years ago

Those statistics seem pretty good for Eesen. I just checked and training token accuracy for the tedlium experiment, which went as high as 99.4% but yet the best WER achievable was 15.8% on the test set. (with LM weight of 0.7, but that varies with language and with experiment, and usually a range of LM weights are tried)

branislavpopovic commented 7 years ago

Yes, I know. But I am still confused with such a large difference between WER and token accuracy. I am aware of the fact that we cannot compare them directly, but we usually have much better WER than PER, not the opposite. 0.7 was the best choice for Eesen (ACWT 0.7, LMWT 1.0, any other value for LMWT gives us worse results). We had somewhat better accuracy in Kaldi with ACWT of about 0.08 and LMWT between 10 and 15.

riebling commented 7 years ago

I'll wait on a more expert explanation, curious now, as well!

fmetze commented 7 years ago

Did you check the beams? How fast are the respective systems? Does the Eesen system maybe prune away too many alternatives?

Another thing to try is the blank scale. You can scale down the blank symbol a bit, which gives you more phones (compared to blank). This can sometimes improve results if you have a high deletion right. Not sure if this is the case here?

F.

On Feb 14, 2017, at 6:09 PM, riebling notifications@github.com wrote:

I'll wait on a more expert explanation, curious now, as well!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/123#issuecomment-279770654, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8aEWHCGg79MSMNE5Pfxm-PFZAuRnks5rcd_WgaJpZM4MArLz.

Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University

branislavpopovic commented 7 years ago

Actually, we have too many substitutions.

beam=17.0 lattice_beam=8.0 max_active=5000

lstm_layer_num=4, lstm_cell_dim=320 TRAIN ACCURACY 92.8732%, VALID ACCURACY 89.2381% %WER 16.96 [ 26911 / 158653, 2300 ins, 4759 del, 19852 sub ]

lstm_layer_num=4, lstm_cell_dim=1024 TRAIN ACCURACY 94.1744%, VALID ACCURACY 89.4255% %WER 17.56 [ 27853 / 158653, 2493 ins, 4667 del, 20693 sub ]

Decoding time is about the same in Kaldi. Most of those substitutions should be solved by increasing the language model weight, but whenever we tried to increase that value (the default value is 1), we obtain higher error rate.

How can we adjust the blank scale?

branislavpopovic commented 7 years ago

These are the results for beam=30, lattice_beam=15, lstm_layer_num=4, lstm_cell_dim=320. WER 16.41 [ 26041 / 158653, 2532 ins, 4587 del, 18922 sub ] There was no significant improvement.

riebling commented 7 years ago

How can we adjust the blank scale?

This might be from code that hasn't yet been merged with the base branch of Eesen. As a preview, to do this in a hard coded way, modify class-prior.cc around line 57, adding a line such that it looks like this:

  tmp_priors(0)*=blank_scale; #where the scale default value is 1.0
  double sum = tmp_priors.Sum();
  tmp_priors.Scale(1.0 / sum);

or of course, you could do it in a more elegant(?!) Kaldi-esque way adding a variable to class-prior.h and set it as an argument to net-output-extract in decode-ctc-lat.sh - this will be available in the next version of Eesen

riebling commented 7 years ago

(and in case you don't notice - the linked class-prior.h file actually contains the modifications necessary to do the more elegant way, so that you can specify blank scale on the command line from a script)

branislavpopovic commented 7 years ago

Thank you. I will try that as soon as I can.

vince62s commented 7 years ago

Hi, What LM are you using ? is this the very pruned LM trigram that fits in the FST graph ?

Also, just a quick question regarding tedlium results. It seems v1 is based on Tedlium release1 = 110 hours of audio and v2-30ms is based on release2 (210 jours even though script still says 110 hours) can you confirm ? did you try release 2 on both 10ms and 30ms frames ?

Also did you guys try some rescoring with bigger LM ?

many questions, sorry. Vincent

fmetze commented 7 years ago

I think (this is from memory), v2-30ms is 200h+ and 30ms, yes. We decoded with the regular LM, which worked fine. We compiled the large Cantab LM into a search graph, and were able to directly decode with that, and saw small gains - but nothing too dramatic. More as a proof of concept (with CI phones, the search graph stays small, even with a large LM). If I remember, there was a small advantage in going to 30ms training vs 10ms, yes. Let me know if you want me to dig up the old experiments.

branislavpopovic commented 7 years ago

Here are the results for different blank scale values.

lstm_layer_num=4, lstm_cell_dim=320 blank_scale=1 wer_7:%WER 16.96 [ 26911 / 158653, 2300 ins, 4759 del, 19852 sub ] blank_scale=0.75 wer_7:%WER 16.79 [ 26640 / 158653, 2085 ins, 5061 del, 19494 sub ] blank_scale=0.5 wer_8:%WER 16.65 [ 26417 / 158653, 2161 ins, 4905 del, 19351 sub ] blank_scale=0.25 wer_8:%WER 16.77 [ 26613 / 158653, 1807 ins, 5708 del, 19098 sub ] blank_scale=0.1 wer_9:%WER 17.75 [ 28160 / 158653, 1907 ins, 6405 del, 19848 sub ]

As you can see, the number of insertions improved a little bit, but we also had an increased number of deleted words.

About LM: KN prune 0.0000001, ngram 1=121197, ngram 2=1279389, ngram 3=357721

vince62s commented 7 years ago

@branislavpopovic are you on tedlium1 or tedlium2 (ie 100hours or 200+ hours ?) also your LM is very small. you may need at least 4M n-grams to get better results.

@fmetze results file in v2-30 is not updated then, right ? does anyone have the actual results for this run? or maybe it was only run at the phone level ?

branislavpopovic commented 7 years ago

@vince62s Those results are not for Tedlium, but for the Serbian language database (mobile speech + books + journalistic corpus). We have tried using some other language models with or without pruning, and that one was the best.

vince62s commented 7 years ago

sorry I misread the begining of th thread.

branislavpopovic commented 7 years ago

Serbian is a highly inflected language and our results are a consequence of a large number of substitutions. In the mean time, we have calculated character error rate, and it was about 1.75% (short words, and also some longer words that differ in no more than one of two characters), which is an answer to our question, so I will close this issue. Blank scale also helps. Thank you.