XLM-R and mT5 comparison

Hi Rabeeh, XLM-R is the same size as mT5-Large's encoder, and mT5 does not really make use of its decoder on sentence labeling tasks (i.e. the only kind of task XLM-R is applicable to, since it's an encoder-only model). Another way of looking at it is that the amount of FLOPs requried to run XLM-R on a given input sequence and produce a classification is roughly the same as the amount of compute required for mT5-Large. So, a fairer comparison is between XLM-R and mT5-Large. However, mT5 was pre-trained for 1/6 as many tokens as XLM-R, so it's not really a fair comparison along that axis. Ultimately, mT5-Large performs better in some cases and worse in others compared to XLM-R, probably due to the pre-training data more than anything. You can always try both and see which works better on your task, or if you are applying the model to a generative task, you need to use mT5 anyways because XLM-R can't generate anything.

google-research / multilingual-t5

XLM-R and mT5 comparison #52