k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.08k stars 211 forks source link

Question about the implementation of MWER #1260

Closed lawlict closed 8 months ago

lawlict commented 8 months ago

Hi, I'm curious that mwer_loss in k2 doesn't substract the average wer, which is emphasized in the paper: image

And the code is: https://github.com/k2-fsa/k2/blob/master/k2/python/k2/mwer_loss.py#L117 image

So does it really matter to subtract the average wer? (I guess no?)

Look forward to your kind response.

danpovey commented 8 months ago

I think you are referring to a paper from Google. I was confused by that statement in the paper. There are two choices: either \hat{W} is with gradient, in which case the gradient of the MWER expression reduces to exactly zero; or \hat{W} is without gradient, in which case it has no effect on the gradient of the resulting expression. So either way it should have no effect. If you compute the gradients "manually", for example on a lattice, or on a collection of n-best paths (e.g. see my thesis), there is actually a "minus average of word error" expression that appears in the computation; but if you are using some kind of autograd framework this is implicit and you don't have to do it.

lawlict commented 8 months ago

@danpovey I get it. Thanks for your answer and the great work!