Closed ghost closed 3 years ago
Hi Rabeeh, XLM-R is the same size as mT5-Large's encoder, and mT5 does not really make use of its decoder on sentence labeling tasks (i.e. the only kind of task XLM-R is applicable to, since it's an encoder-only model). Another way of looking at it is that the amount of FLOPs requried to run XLM-R on a given input sequence and produce a classification is roughly the same as the amount of compute required for mT5-Large. So, a fairer comparison is between XLM-R and mT5-Large. However, mT5 was pre-trained for 1/6 as many tokens as XLM-R, so it's not really a fair comparison along that axis. Ultimately, mT5-Large performs better in some cases and worse in others compared to XLM-R, probably due to the pre-training data more than anything. You can always try both and see which works better on your task, or if you are applying the model to a generative task, you need to use mT5 anyways because XLM-R can't generate anything.
Dear MT5 authors, when I compare the results of XLM-R with mT5-base which is the model which has the same (even a bit more) parameters as XLM-R large model, I see much better results for XLM-R, do you agree that XLM-R is a better cross-lingual model? thanks