Closed lawlict closed 8 months ago
I think you are referring to a paper from Google. I was confused by that statement in the paper. There are two choices: either \hat{W} is with gradient, in which case the gradient of the MWER expression reduces to exactly zero; or \hat{W} is without gradient, in which case it has no effect on the gradient of the resulting expression. So either way it should have no effect. If you compute the gradients "manually", for example on a lattice, or on a collection of n-best paths (e.g. see my thesis), there is actually a "minus average of word error" expression that appears in the computation; but if you are using some kind of autograd framework this is implicit and you don't have to do it.
@danpovey I get it. Thanks for your answer and the great work!
Hi, I'm curious that mwer_loss in k2 doesn't substract the average wer, which is emphasized in the paper:
And the code is: https://github.com/k2-fsa/k2/blob/master/k2/python/k2/mwer_loss.py#L117
So does it really matter to subtract the average wer? (I guess no?)
Look forward to your kind response.