[04/06/2020] 5:15PM GMT+1 : It’s Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information

keleog commented 4 years ago

Link - https://arxiv.org/abs/2005.02354

Abstract: The performance of neural machine translation systems is commonly evaluated in terms of BLEU. However, due to its reliance on target language properties and generation, the BLEU metric does not allow an assessment of which translation directions are more difficult to model. In this paper, we propose cross-mutual information (XMI): an asymmetric information-theoretic metric of machine translation difficulty that exploits the probabilistic nature of most neural machine translation models. XMI allows us to better evaluate the difficulty of translating text into the target language while controlling for the difficulty of the target-side generation component independent of the translation task. We then present the first systematic and controlled study of cross-lingual translation difficulties using modern neural translation systems

hadyelsahar commented 4 years ago

My notes about equation (4) with details if someone wants to dig more on entropy and Cross entropy and KL divergence.

If we use a model q approximate the true distribution of data p , then the cross entropy can be calculated as the amount of Information this model sees when we sample from the data. Or how much the model we trained is surprised when sees the data. For a good model to work this values have to bee low. That's why to train models we minimize the cross entropy.

Note here you can do this in the conditional and the non conditional case but in the paper they use the conditional case (since MT) so let's stick to it. We have to look into stuff here in the conditional case. so p(x) becomes p(t,s) as joint prob. or as conditional p(t|s) and q(x) becomes q(t,s) as joint prob. or as conditional q(t|s)

Note: Equation one

seems a bit non intuitive since they mix between joint P(x,y) and conditional Q(y|x). It is correct, the proof below for the conditional case of entropy, It is straight fwd to convert it to cross entropy: https://en.wikipedia.org/wiki/Conditional_entropy#Motivation

Note: Equation two: equation 2: this approx. is indeed known in non-conditional LMs (see here) I

eq 2: this approx. is indeed known in non-conditional LMs. afaik it's a MC estimate (in layman terms, MC is if I cannot model the probability of something but I can draw samples from I can use those samples to calculate estimates.) of the true data distribution.

In principle this is correct (even for the conditional case), in practice this needs lots of samples to work (which is the case in LMs eval). Test datasets of MT might be too small to get this approx. to work.

sjmielke commented 4 years ago

Copying my take on the samples from Twitter (https://twitter.com/sjmielke/status/1269756886435446788):

"I think MT test sets are large enough---which you can get a feel for by bootstrapping and seeing what happens to results, see Appendix---the bigger issue is that the underlying distribution the test set is "sampled" from is not at the true language distribution :( Driving that point home I think is the large variance we see when LMing supposed translations of the Bible in our 2019 paper: https://twitter.com/sjmielke/status/1138812623447871488

masakhane-io / masakhane-reading-group

[04/06/2020] 5:15PM GMT+1 : It’s Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information #6