Closed keleog closed 4 years ago
My notes about equation (4) with details if someone wants to dig more on entropy and Cross entropy and KL divergence.
Note here you can do this in the conditional and the non conditional case but in the paper they use the conditional case (since MT) so let's stick to it. We have to look into stuff here in the conditional case. so p(x) becomes p(t,s) as joint prob. or as conditional p(t|s) and q(x) becomes q(t,s) as joint prob. or as conditional q(t|s)
Note: Equation one
seems a bit non intuitive since they mix between joint P(x,y) and conditional Q(y|x). It is correct, the proof below for the conditional case of entropy, It is straight fwd to convert it to cross entropy: https://en.wikipedia.org/wiki/Conditional_entropy#Motivation
Note: Equation two: equation 2: this approx. is indeed known in non-conditional LMs (see here) I
eq 2: this approx. is indeed known in non-conditional LMs. afaik it's a MC estimate (in layman terms, MC is if I cannot model the probability of something but I can draw samples from I can use those samples to calculate estimates.) of the true data distribution.
In principle this is correct (even for the conditional case), in practice this needs lots of samples to work (which is the case in LMs eval). Test datasets of MT might be too small to get this approx. to work.
Copying my take on the samples from Twitter (https://twitter.com/sjmielke/status/1269756886435446788):
"I think MT test sets are large enough---which you can get a feel for by bootstrapping and seeing what happens to results, see Appendix---the bigger issue is that the underlying distribution the test set is "sampled" from is not at the true language distribution :( Driving that point home I think is the large variance we see when LMing supposed translations of the Bible in our 2019 paper: https://twitter.com/sjmielke/status/1138812623447871488
Link - https://arxiv.org/abs/2005.02354
Abstract: The performance of neural machine translation systems is commonly evaluated in terms of BLEU. However, due to its reliance on target language properties and generation, the BLEU metric does not allow an assessment of which translation directions are more difficult to model. In this paper, we propose cross-mutual information (XMI): an asymmetric information-theoretic metric of machine translation difficulty that exploits the probabilistic nature of most neural machine translation models. XMI allows us to better evaluate the difficulty of translating text into the target language while controlling for the difficulty of the target-side generation component independent of the translation task. We then present the first systematic and controlled study of cross-lingual translation difficulties using modern neural translation systems