Add XLM-V - Githubissues

mrm8488 commented 1 year ago

Model description

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).

Should work as XLM-RoBERTa

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

No response

jalajk24 commented 1 year ago

Can I work on this issue? And can you point me to where should I learn more about this?

stefan-it commented 1 year ago

Some more info:

Weights can be - according to this tweet this found here:

https://dl.fbaipublicfiles.com/fairseq/xlmv/xlmv.base.tar.gz

stefan-it commented 1 year ago

Hi guys,

I adopted the RoBERTa conversion script and model conversion was sucessful:

https://gist.github.com/stefan-it/def0e13c872e992aa54dff2768ec5da4

It outputs:

torch.Size([1, 11, 901629]) torch.Size([1, 11, 901629])
max_absolute_diff = 7.62939453125e-06
Do both models output the same tensors? 🔥
Saving model to /media/stefan/89914e9b-0644-4f79-8e65-a8c5245df168/xlmv/exported-working
Configuration saved in /media/stefan/89914e9b-0644-4f79-8e65-a8c5245df168/xlmv/exported-working/config.json
Model weights saved in /media/stefan/89914e9b-0644-4f79-8e65-a8c5245df168/xlmv/exported-working/pytorch_model.bin

stefan-it commented 1 year ago

@jalajk24 , sorry, I've overlooked your comment.

Here's an explanation what I did so far:

Finding the official checkpoint (which is a bit hard without Twitter, because XLM-V is not yet mentioned in the official fairseq repo...)
Try to convert the checkpoint with the existing code base
I used the original RoBERTa conversion script and adjust some outdated config parameters (e.g. roberta.args is replaced by roberta.cfg in newer fairseq versions)
Fixing other changed variables, e.g. roberta_sent_encoder.layernorm_embedding must be used instead of the old roberta_sent_encoder.emb_layer_norm
Then conversion runs: when both models (Original fairseq model and the converted model in Transformers) output the same tensor for a given input sequence -> model conversion was sucessful.
If that would not be the case (e.g. we had this when converting XLM-R-XL and XLM-R-XXL models, see here) we need to adjust the model architecture (XLM-R-XL used some pre-layer-norm stuff).

The next steps would be on the tokenizer part:

Load the original checkpoint with fairseq and tokenize some input sentence
Use the XLM-R tokenizer with the new XLM-V sentencepiece vocab and tokenize the same input sentence
Check if both tokenizers output the same tokenized sequence

mrm8488 commented 1 year ago

Cool @stefan-it! So, maye we can create a model card and push the model (and tokenizer) to the hub (under the META AI org). WDYT?

stefan-it commented 1 year ago

@mrm8488 Sounds good! I will perform some tokenizer experiments and then I can upload the model -> maybe @patrickvonplaten can invite me to the Meta AI organization on the model hub (for a short time period), when the model is ready to be... tested on downstream tasks :hugs:

patrickvonplaten commented 1 year ago

Hey @stefan-it,

For sure! Invited you :-)

stefan-it commented 1 year ago

Thanks @patrickvonplaten !

I wrote a script that compares XLM-V tokenizer and HF tokenizer (which is basically a XLMRobertaTokenizer using the provided sentencepiece.bpe.model model):

https://gist.github.com/stefan-it/14295d37880bfb6329fe1db9d3e6a14c

It uses the WikiANN NER dataset that contains 176 languages, tokenizes each training sentence and compares the output of the original XLM-V tokenizer and the HF one. Some differences can be seen in the GIST mentioned above, e.g.:

Mismatch for ar sentence:
أبى أيوب الأنصارى .‌
XLM-V ids: [0, 6, 482745, 6, 529250, 478338, 382485, 6, 5, 2]
HF    ids: [0, 6, 482745, 6, 529250, 478338, 382485, 6, 5, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for az sentence:
O , nəinki Çexiyada , eləcə də bütün dünyada antifaşist ədəbiyyatının ən görkəmli nümayəndələrindən biridir .
XLM-V ids: [0, 122, 6, 4, 78808, 2376, 4377, 25427, 6, 4, 17739, 523, 1174, 14374, 214304, 162, 4193, 3386, 1358, 1105, 1221, 89755, 345, 1825, 63822, 19671, 8914, 280, 214304, 499, 162, 381, 6, 5, 2]
HF    ids: [0, 122, 6, 4, 78808, 2376, 4377, 25427, 6, 4, 17739, 523, 1174, 14374, 162, 214304, 4193, 3386, 1358, 1105, 1221, 89755, 345, 1825, 63822, 19671, 8914, 280, 214304, 499, 162, 381, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for az sentence:
Filmin bəstəkarı Roberto Rossellininin qardaşı Renzo Rossellinidir .
XLM-V ids: [0, 70066, 93154, 309, 77404, 862785, 1639, 43, 49187, 872558, 862785, 43, 14803, 6, 5, 2]
HF    ids: [0, 70066, 93154, 309, 77404, 862785, 43, 1639, 49187, 872558, 862785, 43, 14803, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for be sentence:
некаторыя аленяводы з верхняй Калымы ўжо качавалі на чукоцкіх землях .
XLM-V ids: [0, 212747, 187222, 187276, 231515, 186902, 245172, 186910, 191873, 187211, 186906, 190574, 202645, 197768, 186882, 190562, 187180, 217232, 212793, 6, 5, 2]
HF    ids: [0, 212747, 187222, 187276, 231515, 186902, 245172, 186910, 191873, 187211, 186906, 190574, 217400, 192302, 186882, 190562, 187180, 217232, 212793, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for bn sentence:
আব্রাআম দ্য মোয়াভ্র্‌
XLM-V ids: [0, 450078, 447452, 391401, 383767, 442939, 388008, 392002, 500283, 388127, 2]
HF    ids: [0, 450078, 447452, 391401, 383767, 442939, 388008, 392002, 500283, 388127, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for ckb sentence:
شەڕی ناوخۆییی لیبیا ( ٢٠١١ )
XLM-V ids: [0, 448384, 3, 382407, 424947, 383163, 395213, 390588, 382407, 481417, 18, 430460, 396007, 1057, 2]
HF    ids: [0, 448384, 3, 382407, 424947, 383163, 395213, 382407, 390588, 481417, 18, 430460, 396007, 1057, 2]
------------------------------------------------------------------------------------------
Mismatch for el sentence:
το λιμάνι του Μαρσασλόκκκ ήταν Φοινικική αποικία .
XLM-V ids: [0, 51, 33074, 54, 20175, 4103, 2207, 21516, 180155, 2263, 702, 1764, 179092, 1457, 127312, 1100, 6, 5, 2]
HF    ids: [0, 51, 33074, 54, 20175, 4103, 2207, 21516, 2263, 180155, 702, 1764, 179092, 1457, 127312, 1100, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for eu sentence:
Þjóðólfur úr Hvini‎
XLM-V ids: [0, 576603, 584875, 704, 7755, 272, 110340, 2]
HF    ids: [0, 576603, 584875, 704, 7755, 272, 110340, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for fi sentence:
ohjaus British Wind Energy Association‎
XLM-V ids: [0, 18196, 82236, 60938, 48570, 71969, 2]
HF    ids: [0, 18196, 82236, 60938, 48570, 71969, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for fr sentence:
***************************** '' Charles de Bourbon-Siciles ''
XLM-V ids: [0, 541, 519880, 736484, 519880, 3426, 17736, 59, 648141, 13, 238, 676633, 11, 3426, 2]
HF    ids: [0, 541, 736484, 519880, 519880, 3426, 17736, 59, 648141, 13, 238, 676633, 11, 3426, 2]
------------------------------------------------------------------------------------------
Mismatch for hr sentence:
*KKK Varteks ( Varaždin )
XLM-V ids: [0, 541, 13108, 379, 2056, 11962, 18, 794202, 1057, 2]
HF    ids: [0, 541, 379, 13108, 2056, 11962, 18, 794202, 1057, 2]
------------------------------------------------------------------------------------------
Mismatch for ja sentence:
漳 州 訛 り 、 ' ' ' 泉 ' ' ' は 泉 州 訛 り を 表 す ） ] ] ‎
XLM-V ids: [0, 6, 381875, 6, 284214, 6, 371882, 6, 283722, 6, 283381, 536, 536, 536, 6, 287298, 536, 536, 536, 6, 283385, 6, 287298, 6, 284214, 6, 371882, 6, 283722, 6, 283391, 6, 284061, 6, 284248, 1057, 6305, 6305, 2]
HF    ids: [0, 6, 381875, 6, 284214, 6, 371882, 6, 283722, 6, 283381, 536, 536, 536, 6, 287298, 536, 536, 536, 6, 283385, 6, 287298, 6, 284214, 6, 371882, 6, 283722, 6, 283391, 6, 284061, 6, 284248, 1057, 6305, 6305, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for km sentence:
' '' ក្រមង៉ុយ '' 'គឺជាកវីម្នាក់ដែលមិនសរសេរនូវកំណាព្យកាព្យឃ្លោងដែលលោកច្រៀងនោះ ឡើយ ។ ស្នាដៃរបស់លោកដែលគង់វង្សមកដល់សព្វថ្ងៃនេះកើតមានឡើងដោយការអញ្ជើញ ភ្នំពេញ ហើយធ្វើការកត់ត្រាទុក ។
XLM-V ids: [0, 536, 3426, 6, 436488, 414054, 470537, 406071, 3426, 536, 417648, 388584, 417615, 398401, 383964, 386188, 484094, 413545, 430365, 392709, 443000, 401931, 443000, 513438, 424986, 383964, 383825, 6, 470313, 392431, 445340, 383824, 6, 527700, 384224, 383825, 383964, 6, 486458, 486640, 6, 454853, 6, 504066, 459752, 423127, 386428, 410408, 385471, 383363, 510944, 394566, 386849, 388469, 383363, 384712, 398013, 438262, 423820, 383824, 2]
HF    ids: [0, 536, 3426, 6, 436488, 414054, 470537, 406071, 3426, 536, 417648, 388584, 417615, 398401, 383964, 386188, 484094, 413545, 430365, 392709, 443000, 401931, 443000, 513438, 424986, 383964, 383825, 6, 470313, 392431, 445340, 383824, 6, 527700, 384224, 383825, 383964, 6, 486458, 486640, 6, 454853, 6, 504066, 459752, 423127, 386428, 410408, 385471, 383363, 510944, 394566, 386849, 388469, 383363, 384712, 398013, 438262, 423820, 383824, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for ko sentence:
북쪽으로는 사바 구 , 서쪽으로는 소피아 구 , 남서쪽으로는 알라오트라망고로 구 , 남쪽으로는 아치나나나 구와 접한다 .
XLM-V ids: [0, 460610, 402460, 383267, 384648, 384084, 6, 4, 464357, 402460, 383973, 408125, 384084, 6, 4, 384737, 497040, 402460, 384068, 382873, 383469, 420080, 387243, 382503, 382498, 384084, 6, 4, 445962, 402460, 383309, 383375, 459065, 382738, 384084, 382541, 390528, 383229, 6, 5, 2]
HF    ids: [0, 460610, 402460, 383267, 384648, 384084, 6, 4, 464357, 402460, 383973, 408125, 384084, 6, 4, 384737, 497040, 402460, 384068, 382873, 383469, 420080, 387243, 382503, 382498, 384084, 6, 4, 445962, 402460, 383309, 383375, 382738, 459065, 384084, 382541, 390528, 383229, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for lv sentence:
Eiropas autoceļš E77‎
XLM-V ids: [0, 3477, 121549, 619, 181, 6697, 2]
HF    ids: [0, 3477, 121549, 619, 181, 6697, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for mk sentence:
Поретко , на пример во делови од Пиринска Македонија и Егејска Македонија некои од горните женски облеки – ‘’’саите’’’ се кроеле од домашно ткаено платно во сина боја .
XLM-V ids: [0, 186970, 192733, 187180, 6, 4, 186882, 188182, 186930, 201221, 186939, 221926, 187217, 187685, 186883, 248608, 211453, 187685, 193651, 186939, 240530, 198728, 186987, 187184, 186991, 39, 14464, 42, 187373, 186961, 11099, 42, 186894, 203637, 197766, 186939, 210461, 6, 189541, 188031, 212555, 186930, 194795, 199817, 6, 5, 2]
HF    ids: [0, 186970, 192733, 187180, 6, 4, 186882, 188182, 186930, 201221, 186939, 221926, 187217, 187685, 186883, 248608, 211453, 187685, 193651, 186939, 240530, 198728, 186987, 187184, 186991, 39, 14464, 42, 187373, 186961, 42, 11099, 186894, 203637, 197766, 186939, 210461, 6, 189541, 188031, 212555, 186930, 194795, 199817, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for ml sentence:
അനു എലിസബത്ത് ജോസ്‌
XLM-V ids: [0, 397569, 385011, 528343, 388795, 385776, 481383, 2]
HF    ids: [0, 397569, 385011, 528343, 388795, 385776, 481383, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for ms sentence:
███ Sidang Kemuncak Asia Timur
XLM-V ids: [0, 6, 369908, 377468, 593458, 3944, 664695, 8451, 551742, 2]
HF    ids: [0, 6, 377468, 369908, 593458, 3944, 664695, 8451, 551742, 2]
------------------------------------------------------------------------------------------
Mismatch for no sentence:
De siste tre semestre var han i Grenoble i Frankrike , der mye av fritiden ble tilbrakt i Les2alpes og LaGrave .
XLM-V ids: [0, 447, 550187, 17752, 611647, 246, 25684, 28, 657552, 28, 557692, 6, 4, 2860, 549299, 15446, 617530, 117029, 664714, 28, 17112, 430, 460, 10083, 6995, 1079, 29815, 383, 6, 5, 2]
HF    ids: [0, 447, 550187, 17752, 611647, 246, 25684, 28, 657552, 28, 557692, 6, 4, 2860, 549299, 15446, 617530, 117029, 664714, 28, 17112, 430, 460, 10083, 6995, 1079, 597, 573563, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for or sentence:
ଲେଉଟାଣି ଜୋହାନ୍ ଅଗଷ୍ଟସ ଆର୍ଫୱେଡ଼ସନ୍‌
XLM-V ids: [0, 6, 387665, 391689, 393963, 403921, 393333, 392380, 395060, 388377, 522433, 387310, 6, 476299, 398439, 432754, 392919, 424507, 2]
HF    ids: [0, 6, 387665, 391689, 393963, 403921, 393333, 392380, 395060, 388377, 522433, 387310, 6, 476299, 398439, 432754, 392919, 424507, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for sh sentence:
Kefej ( kralj Tegeje ) ‎
XLM-V ids: [0, 3944, 12705, 18, 793761, 96767, 382, 1057, 2]
HF    ids: [0, 3944, 12705, 18, 793761, 96767, 382, 1057, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for sl sentence:
__________10__________ Eugenio Siena Alfa Romeo
XLM-V ids: [0, 272238, 1741, 666448, 12002, 848378, 836660, 26591, 72466, 2]
HF    ids: [0, 272238, 1741, 12002, 666448, 848378, 836660, 26591, 72466, 2]
------------------------------------------------------------------------------------------
Mismatch for sr sentence:
Прерасподела доходка , Економски факултет Београд USJF - Preraspodela dohotka.ppt‎
XLM-V ids: [0, 188107, 189047, 187172, 192298, 190169, 186948, 6, 4, 228329, 186887, 192995, 190449, 15373, 662660, 20, 1182, 120, 793095, 567795, 656994, 90130, 5, 457258, 2]
HF    ids: [0, 188107, 189047, 187172, 192298, 190169, 186948, 6, 4, 228329, 186887, 192995, 190449, 15373, 662660, 20, 1182, 120, 793095, 567795, 656994, 90130, 5, 457258, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for te sentence:
దారిమార్పు ఇండియన్‌ ఇన్‌స్టిట్యూట్‌ ఆఫ్‌ టెక్నాలజీ మద్రాస్‌
XLM-V ids: [0, 436137, 464065, 387183, 460474, 400919, 520935, 493353, 384438, 397587, 466836, 385426, 480198, 383019, 2]
HF    ids: [0, 436137, 464065, 387183, 460474, 400919, 520935, 493353, 384438, 397587, 466836, 385426, 480198, 383019, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for ur sentence:
جاوید شیخ -‎ ‎جاوید ‏‎ ‎
XLM-V ids: [0, 408290, 389645, 20, 408290, 2]
HF    ids: [0, 408290, 389645, 20, 408290, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for uz sentence:
Dastlab Oltin Oʻrdattt asosiy siyosiy markazi hisoblangan .
XLM-V ids: [0, 61568, 14, 3181, 586435, 43, 122, 1476, 47569, 211172, 14, 15966, 43523, 22564, 42030, 7050, 6, 5, 2]
HF    ids: [0, 61568, 14, 3181, 586435, 43, 122, 1476, 47569, 14, 211172, 15966, 43523, 22564, 42030, 7050, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for zh-yue sentence:
R E D I R E C T # 巴 菲 特 ‎
XLM-V ids: [0, 266, 181, 205, 168, 266, 181, 232, 157, 524, 335519, 6, 286994, 6, 283738, 2]
HF    ids: [0, 266, 181, 205, 168, 266, 181, 232, 157, 524, 335519, 6, 286994, 6, 283738, 6, 2]
------------------------------------------------------------------------------------------

stefan-it commented 1 year ago

Can we tolerate these mismatches :thinking:

stefan-it commented 1 year ago

Model is up now on the model hub:

https://huggingface.co/stefan-it/xlm-v-base

-> I would like to conduct some experiments on downstream tasks (mainly NER) to measure performance.

Maybe e.g. @mrm8488 also wants to fine-tune models so that we can try to reproduce some of the paper results :)

After some experiments I can transfer the model to the Meta AI organization. The MLM performance is really good, so the model should work:

In [3]: unmasker("Paris is the <mask> of France.")
Out[3]: 
[{'score': 0.9286897778511047,
  'token': 133852,
  'token_str': 'capital',
  'sequence': 'Paris is the capital of France.'},
 {'score': 0.018073994666337967,
  'token': 46562,
  'token_str': 'Capital',
  'sequence': 'Paris is the Capital of France.'},
 {'score': 0.013238662853837013,
  'token': 8696,
  'token_str': 'centre',
  'sequence': 'Paris is the centre of France.'},
 {'score': 0.010450296103954315,
  'token': 550136,
  'token_str': 'heart',
  'sequence': 'Paris is the heart of France.'},
 {'score': 0.005028395913541317,
  'token': 60041,
  'token_str': 'center',
  'sequence': 'Paris is the center of France.'}]

mrm8488 commented 1 year ago

Thank you so much @stefan-it. Ofc, I will try to reproduce some of the reported results.

stefan-it commented 1 year ago

I've replicated the MasakhaNER v1 results from the paper:

I fine-tuned 5 models (with different seeds) on the English WikiANN (Rahimi split) and evaluated them on MasakhaNER v1. Note: DATE entities do not exist in WikiANN, so they were replaced with O for zero-shot evaluation. I averaged F1-Score over the 5 models to get the final score. Models were fine-tuned with a sequence length of 512 (paper uses 128, I recognized this after fine-tuning experiments), but other hyper-parameter are the same as used in XLM-V paper: Batch size is 32, learning rate 2e-05 and number of epochs is 10.

Putting it all together (see Table 11 in XLM-V paper):

Model	amh	hau	ibo	kin	lug	luo	pcm	swa	wol	yor	Avg.
XLM-R (Paper)	25.1	43.5	11.6	9.4	9.5	8.4	36.8	48.9	5.3	10.0	20.9
XLM-R (Reproduced)	27.1	42.4	14.2	12.4	14.3	10.0	40.6	50.2	6.3	11.5	22.9
XLM-V (Paper)	20.6	35.9	45.9	25.0	48.7	10.4	38.2	44.0	16.7	35.8	32.1
XLM-V (Reproduced)	25.3	45.7	55.6	33.2	56.1	16.5	40.7	50.8	26.3	47.2	39.7

Performance diff for WikiANN between XLM-R and XLM-V in the paper is 11.2%. Reproduced experiments gave an performance diff of 16.8%.

So I think these experiments show, that the model is working and it achieves great results on MasakhaNER v1!

I will set-up a repository for all these results and conduct more experiments on WikiANN (second NER downstream tasks that is mentioned in in the paper).

@patrickvonplaten Do you think the model is then ready to be moved to the Meta AI org? I've also written an initial model card.

stefan-it commented 1 year ago

Here's the comparison on WikiANN zero-shot (see Table10 in XLM-V paper):

Model	ro	gu	pa	lt	az	uk	pl	qu	hu	fi	et	tr	kk	zh	my	yo	sw
XLM-R (Paper)	73.5	62.9	53.6	72.7	61.0	72.4	77.5	60.4	75.8	74.4	71.2	75.4	42.2	25.3	48.9	33.6	66.3
XLM-R (Reproduced)	73.8	65.5	50.6	74.3	64.0	76.5	78.4	60.8	77.7	75.9	73.0	76.4	45.2	29.8	52.3	37.6	67.0
XLM-V (Paper)	73.8	66.4	48.7	75.6	66.7	65.7	79.5	70.0	79.5	78.7	75.0	77.3	50.4	30.2	61.5	54.2	72.4
XLM-V (Reproduced)	77.2	65.4	53.6	74.9	66.0	69.4	79.8	66.9	79.0	77.9	76.2	76.8	48.5	28.1	58.4	62.6	71.6

Model	th	ko	ka	ja	ru	bg	es	pt	it	fr	fa	ur	mr	hi	bn	el	de
XLM-R (Paper)	5.2	49.4	65.4	21.0	63.1	76.1	70.2	77.0	76.9	76.5	44.6	51.4	61.5	67.2	69.0	73.8	74.4
XLM-R (Reproduced)	4.7	49.4	67.5	21.9	65.2	77.5	76.7	79.0	77.7	77.9	49.0	55.1	61.3	67.8	69.6	74.1	75.4
XLM-V (Paper)	3.3	53.0	69.5	22.4	68.1	79.8	74.5	80.5	78.7	77.6	50.6	48.9	59.8	67.3	72.6	76.7	76.8
XLM-V (Reproduced)	2.6	51.6	71.2	20.6	67.8	79.4	76.2	79.9	79.5	77.5	51.7	51.5	61.9	69.2	73.2	75.9	77.1

Model	en	nl	af	te	ta	ml	eu	tl	ms	jv	id	vi	he	ar	Avg.
XLM-R (Paper)	83.0	80.0	75.8	49.2	56.3	61.9	57.2	69.8	68.3	59.4	48.6	67.7	53.2	43.8	61.3
XLM-R (Reproduced)	83.4	80.8	75.8	49.3	56.8	62.2	59.1	72.2	62.3	58.3	50.0	67.9	52.6	47.8	62.6
XLM-V (Paper)	83.4	81.4	78.3	51.8	54.9	63.1	67.1	75.6	70.0	67.5	52.6	67.1	60.1	45.8	64.7
XLM-V (Reproduced)	84.1	81.3	78.9	50.9	55.9	63.0	65.7	75.9	70.8	64.8	53.9	69.6	61.1	47.2	65.0

Diff. between XLM-V and XLM-R in the paper: (64.7 - 61.3) = 3.4%. Diff. between reproduced XLM-V and XLM-R: (65.0 - 62.6) = 2.4%.

Same conclusion: the converted/integrated XLM-V works great :hugs:

mrm8488 commented 1 year ago

Great job @stefan-it !!! 🔥

stefan-it commented 1 year ago

Thanks @mrm8488 !

Repo is btw: up here: https://github.com/stefan-it/xlm-v-experiments :)

NielsRogge commented 1 year ago

Thanks a lot for your contribution @stefan-it 🙏

Just transferred the checkpoint to the appropriate organization: https://huggingface.co/facebook/xlm-v-base

However, I feel like it could be beneficial to have a separate model_doc for XLM-V (similar to how we did this for T5v1.1 etc.).

Do you mind opening a PR for that?

NielsRogge commented 1 year ago

Thanks! Closing this issue as the model is now available: https://huggingface.co/docs/transformers/main/en/model_doc/xlm-v.

patrickvonplaten commented 1 year ago

Amazing work @stefan-it - thanks a lot!

mrm8488 commented 1 year ago

Amazing @stefan-it . Should I add some ft metric @patrickvonplaten as done for other models? I fine-tuned it on XNLI: https://huggingface.co/mrm8488/xlm-v-base-finetuned-xglue-xnli

huggingface / transformers

Add XLM-V #21330

Model description

Open source status

Provide useful links for the implementation