Open efosler opened 6 years ago
I'm mulling over different solutions for this. Input welcome, particularly from @ramonsanabria and/or @fmetze
Thanks all, I have been been following, but did not actively participate.
I don’t recall all the names of the flags and how things are implemented (I was mostly working on my own branch, different from Ramon’s and then ended up mostly using a mix of the two), but I would really appreciate a cleaned-up and tested version of this, and I’ll be happy to help with that going forward.
Everything that’s been said here and in #194 is true - we need priors during decoding, they are being generated from phone counts (as discussed in the original Eesen paper), and should be written to a file during training. Sub-sampling and stacking should be independent from each other, but often one would use “3” for both of them. Sub-sampling is good, because it makes training faster and improves quality. We observed slight improvements by creating three copies of the data, with offsets 0, 1, and 2. This is what the on-the-fly augmentation does.
I think we can get rid of “—roll” (did not do much either way) for simplicity, but it seems that the sub-sampling is not handled correctly in the main branch tf_test? Bummer. If sub-sampling does not happen, the LSTM will be all off, and the results can be really bad. You can easily find out what is happening by looking at the lengths of the matrices that are being handled. I think we want to allow sub-sampling, and for now it may be enough to simply always decode with sub-sampling “N” and offset “0”.
We did tests combinging multiple decodings with sub-sampling and different offset factors using ROVER, and these gave good gains, and we then tested score averaging (i.e., averaging the output of the CTC acoustic models, before sticking it into the search - we tried logits and posteriors, I think, without significant difference), which gave somewhat smaller gains, but we did still gain a little bit. Agree that intuitively it is not clear why it should work, but I think we have the same acoustic model applied to features that are essentially also the same, so averaging does something useful, and maybe it helps to smooth the scores a little but, which may help with beam search during decoding.
I hope this is helpful? Let me know what you find - I should be able to help more going forward.
Florian
On Aug 19, 2018, at 7:02 PM, Eric Fosler-Lussier notifications@github.com wrote:
I'm mulling over different solutions for this. Input welcome, particularly from @ramonsanabria https://github.com/ramonsanabria and/or @fmetze https://github.com/fmetze The simplest solution is to print out a warning in tf_test if subsampling > 1. (--roll should also be discouraged). The current arguments do support stacking but not subsampling, but can get you into trouble if you don't know what you're doing (as I didn't). More extreme is not allowing subsampling > 1 in test Most involved would be to create an explicit combination scheme argument, which defaults to "use first" (which, in combination with --roll, would give you a random shift), but could also access the "average" scheme. It's just not clear to me that averaging makes sense under CTC (unlike regular frame-based systems), so whether the averaging technique should even be preserved is not clear. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/194#issuecomment-414141051, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8dBmS71g1U8E0Rxl2QH7qvJrPZLaks5uSZoVgaJpZM4WCqcj.
Last night's run, using only SWB grammar and giving the forward pass --subsampled_utt 1 (which basically will select the first copy it sees), resulted in 20.0 WER, which is the best that I've seen on this pipeline. So turning off averaging did really help.
@fmetze were you suggesting that there was another method you used for averaging that did work (outside the code base) or that you used this particular code and got reasonable results? What is in the code looked reasonable to me as I worked through it.
Given that there's been widely varying experiences with subsampling combination, I think the best thing is to do the right thing and add another flag to the decode portion and make averaging an option (but not on by default). I can code that today - hoping to put this bit to rest, clean up the code a bit and then submit pull changes for the baseline recipe. If I'm feeling feisty I might even add comments so that the next sojourner has some signposts... :-)
Side note: I just realized some of my questions (e.g. role of nnet.py) arose because I had broken out stuff from decode_ctc_lat_tf.sh and put it directly in run_ctc_phn.sh when I started this and forgot that I had done so.
Sorry for being disappeared from this thread. We had (still have) some evaluation going on.
Regarding the averaging vs taking one frame. In CharRNN decoding we experimented the exact same thing: taking one frame works better than averaging. However, @fmetze is right in the sense that we got that performing ROVER with different decoding strategies helped (I dont remember in which dataset or experiments). But yes @efosler is right, I think that the best way to go is having average as a flag rather than default.
We can clearly remove --roll is something that never worked for me.
thank you again for this Eric. This is great!
Augmented streams (which have stacked frames shifted) create n copies of the input. At test time the logit streams are averaged together. However, this is buggy under CTC training, as the blank label can dominate other labels in the averaged stream under CTC. Documented more fully under #193 .
Proposed fix: change test code to not create shifted copies after stacking. Also have proposed changes to training to allow dumping of logit stream during cv pass; will output first encountered stream instead of averaging. Discussion welcome.