Open amineabdaoui opened 2 years ago
Hi Amine, Part of the reason may be due to the non-Latin script for Japanese and Chinese and for Yoruba due to there being relatively little pre-training data available.
Hi Sebastian,
You are right, this is probably part of the reason.
Regarding Yoruba, it seems that this language was not considered in the pre-training data of XLMR (it is not present in the list of languages of the CC-100 corpus). While it has been included in mBERT pretraining data.
But the cross lingual transfer from English to Japanese and to Chinese is much better on the remaining tasks. The drop seems to be significant only in Token-Level tasks (NER and POS).
Hi, Any idea why XLMR results on UDPOS are so bad for Japanese, Chinese and Yoruba? Thanks