Closed mrm8488 closed 1 year ago
Can I work on this issue? And can you point me to where should I learn more about this?
Some more info:
Weights can be - according to this tweet this found here:
https://dl.fbaipublicfiles.com/fairseq/xlmv/xlmv.base.tar.gz
Hi guys,
I adopted the RoBERTa conversion script and model conversion was sucessful:
https://gist.github.com/stefan-it/def0e13c872e992aa54dff2768ec5da4
It outputs:
torch.Size([1, 11, 901629]) torch.Size([1, 11, 901629])
max_absolute_diff = 7.62939453125e-06
Do both models output the same tensors? 🔥
Saving model to /media/stefan/89914e9b-0644-4f79-8e65-a8c5245df168/xlmv/exported-working
Configuration saved in /media/stefan/89914e9b-0644-4f79-8e65-a8c5245df168/xlmv/exported-working/config.json
Model weights saved in /media/stefan/89914e9b-0644-4f79-8e65-a8c5245df168/xlmv/exported-working/pytorch_model.bin
@jalajk24 , sorry, I've overlooked your comment.
Here's an explanation what I did so far:
fairseq
repo...)roberta.args
is replaced by roberta.cfg
in newer fairseq
versions)roberta_sent_encoder.layernorm_embedding
must be used instead of the old roberta_sent_encoder.emb_layer_norm
fairseq
model and the converted model in Transformers) output the same tensor for a given input sequence -> model conversion was sucessful.The next steps would be on the tokenizer part:
fairseq
and tokenize some input sentenceXLM-R
tokenizer with the new XLM-V sentencepiece vocab and tokenize the same input sentenceCool @stefan-it! So, maye we can create a model card and push the model (and tokenizer) to the hub (under the META AI org). WDYT?
@mrm8488 Sounds good! I will perform some tokenizer experiments and then I can upload the model -> maybe @patrickvonplaten can invite me to the Meta AI organization on the model hub (for a short time period), when the model is ready to be... tested on downstream tasks :hugs:
Hey @stefan-it,
For sure! Invited you :-)
Thanks @patrickvonplaten !
I wrote a script that compares XLM-V tokenizer and HF tokenizer (which is basically a XLMRobertaTokenizer
using the provided sentencepiece.bpe.model
model):
https://gist.github.com/stefan-it/14295d37880bfb6329fe1db9d3e6a14c
It uses the WikiANN NER dataset that contains 176 languages, tokenizes each training sentence and compares the output of the original XLM-V tokenizer and the HF one. Some differences can be seen in the GIST mentioned above, e.g.:
Mismatch for ar sentence:
أبى أيوب الأنصارى .
XLM-V ids: [0, 6, 482745, 6, 529250, 478338, 382485, 6, 5, 2]
HF ids: [0, 6, 482745, 6, 529250, 478338, 382485, 6, 5, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for az sentence:
O , nəinki Çexiyada , eləcə də bütün dünyada antifaşist ədəbiyyatının ən görkəmli nümayəndələrindən biridir .
XLM-V ids: [0, 122, 6, 4, 78808, 2376, 4377, 25427, 6, 4, 17739, 523, 1174, 14374, 214304, 162, 4193, 3386, 1358, 1105, 1221, 89755, 345, 1825, 63822, 19671, 8914, 280, 214304, 499, 162, 381, 6, 5, 2]
HF ids: [0, 122, 6, 4, 78808, 2376, 4377, 25427, 6, 4, 17739, 523, 1174, 14374, 162, 214304, 4193, 3386, 1358, 1105, 1221, 89755, 345, 1825, 63822, 19671, 8914, 280, 214304, 499, 162, 381, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for az sentence:
Filmin bəstəkarı Roberto Rossellininin qardaşı Renzo Rossellinidir .
XLM-V ids: [0, 70066, 93154, 309, 77404, 862785, 1639, 43, 49187, 872558, 862785, 43, 14803, 6, 5, 2]
HF ids: [0, 70066, 93154, 309, 77404, 862785, 43, 1639, 49187, 872558, 862785, 43, 14803, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for be sentence:
некаторыя аленяводы з верхняй Калымы ўжо качавалі на чукоцкіх землях .
XLM-V ids: [0, 212747, 187222, 187276, 231515, 186902, 245172, 186910, 191873, 187211, 186906, 190574, 202645, 197768, 186882, 190562, 187180, 217232, 212793, 6, 5, 2]
HF ids: [0, 212747, 187222, 187276, 231515, 186902, 245172, 186910, 191873, 187211, 186906, 190574, 217400, 192302, 186882, 190562, 187180, 217232, 212793, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for bn sentence:
আব্রাআম দ্য মোয়াভ্র্
XLM-V ids: [0, 450078, 447452, 391401, 383767, 442939, 388008, 392002, 500283, 388127, 2]
HF ids: [0, 450078, 447452, 391401, 383767, 442939, 388008, 392002, 500283, 388127, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for ckb sentence:
شەڕی ناوخۆییی لیبیا ( ٢٠١١ )
XLM-V ids: [0, 448384, 3, 382407, 424947, 383163, 395213, 390588, 382407, 481417, 18, 430460, 396007, 1057, 2]
HF ids: [0, 448384, 3, 382407, 424947, 383163, 395213, 382407, 390588, 481417, 18, 430460, 396007, 1057, 2]
------------------------------------------------------------------------------------------
Mismatch for el sentence:
το λιμάνι του Μαρσασλόκκκ ήταν Φοινικική αποικία .
XLM-V ids: [0, 51, 33074, 54, 20175, 4103, 2207, 21516, 180155, 2263, 702, 1764, 179092, 1457, 127312, 1100, 6, 5, 2]
HF ids: [0, 51, 33074, 54, 20175, 4103, 2207, 21516, 2263, 180155, 702, 1764, 179092, 1457, 127312, 1100, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for eu sentence:
Þjóðólfur úr Hvini
XLM-V ids: [0, 576603, 584875, 704, 7755, 272, 110340, 2]
HF ids: [0, 576603, 584875, 704, 7755, 272, 110340, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for fi sentence:
ohjaus British Wind Energy Association
XLM-V ids: [0, 18196, 82236, 60938, 48570, 71969, 2]
HF ids: [0, 18196, 82236, 60938, 48570, 71969, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for fr sentence:
***************************** '' Charles de Bourbon-Siciles ''
XLM-V ids: [0, 541, 519880, 736484, 519880, 3426, 17736, 59, 648141, 13, 238, 676633, 11, 3426, 2]
HF ids: [0, 541, 736484, 519880, 519880, 3426, 17736, 59, 648141, 13, 238, 676633, 11, 3426, 2]
------------------------------------------------------------------------------------------
Mismatch for hr sentence:
*KKK Varteks ( Varaždin )
XLM-V ids: [0, 541, 13108, 379, 2056, 11962, 18, 794202, 1057, 2]
HF ids: [0, 541, 379, 13108, 2056, 11962, 18, 794202, 1057, 2]
------------------------------------------------------------------------------------------
Mismatch for ja sentence:
漳 州 訛 り 、 ' ' ' 泉 ' ' ' は 泉 州 訛 り を 表 す ) ] ]
XLM-V ids: [0, 6, 381875, 6, 284214, 6, 371882, 6, 283722, 6, 283381, 536, 536, 536, 6, 287298, 536, 536, 536, 6, 283385, 6, 287298, 6, 284214, 6, 371882, 6, 283722, 6, 283391, 6, 284061, 6, 284248, 1057, 6305, 6305, 2]
HF ids: [0, 6, 381875, 6, 284214, 6, 371882, 6, 283722, 6, 283381, 536, 536, 536, 6, 287298, 536, 536, 536, 6, 283385, 6, 287298, 6, 284214, 6, 371882, 6, 283722, 6, 283391, 6, 284061, 6, 284248, 1057, 6305, 6305, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for km sentence:
' '' ក្រមង៉ុយ '' 'គឺជាកវីម្នាក់ដែលមិនសរសេរនូវកំណាព្យកាព្យឃ្លោងដែលលោកច្រៀងនោះ ឡើយ ។ ស្នាដៃរបស់លោកដែលគង់វង្សមកដល់សព្វថ្ងៃនេះកើតមានឡើងដោយការអញ្ជើញ ភ្នំពេញ ហើយធ្វើការកត់ត្រាទុក ។
XLM-V ids: [0, 536, 3426, 6, 436488, 414054, 470537, 406071, 3426, 536, 417648, 388584, 417615, 398401, 383964, 386188, 484094, 413545, 430365, 392709, 443000, 401931, 443000, 513438, 424986, 383964, 383825, 6, 470313, 392431, 445340, 383824, 6, 527700, 384224, 383825, 383964, 6, 486458, 486640, 6, 454853, 6, 504066, 459752, 423127, 386428, 410408, 385471, 383363, 510944, 394566, 386849, 388469, 383363, 384712, 398013, 438262, 423820, 383824, 2]
HF ids: [0, 536, 3426, 6, 436488, 414054, 470537, 406071, 3426, 536, 417648, 388584, 417615, 398401, 383964, 386188, 484094, 413545, 430365, 392709, 443000, 401931, 443000, 513438, 424986, 383964, 383825, 6, 470313, 392431, 445340, 383824, 6, 527700, 384224, 383825, 383964, 6, 486458, 486640, 6, 454853, 6, 504066, 459752, 423127, 386428, 410408, 385471, 383363, 510944, 394566, 386849, 388469, 383363, 384712, 398013, 438262, 423820, 383824, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for ko sentence:
북쪽으로는 사바 구 , 서쪽으로는 소피아 구 , 남서쪽으로는 알라오트라망고로 구 , 남쪽으로는 아치나나나 구와 접한다 .
XLM-V ids: [0, 460610, 402460, 383267, 384648, 384084, 6, 4, 464357, 402460, 383973, 408125, 384084, 6, 4, 384737, 497040, 402460, 384068, 382873, 383469, 420080, 387243, 382503, 382498, 384084, 6, 4, 445962, 402460, 383309, 383375, 459065, 382738, 384084, 382541, 390528, 383229, 6, 5, 2]
HF ids: [0, 460610, 402460, 383267, 384648, 384084, 6, 4, 464357, 402460, 383973, 408125, 384084, 6, 4, 384737, 497040, 402460, 384068, 382873, 383469, 420080, 387243, 382503, 382498, 384084, 6, 4, 445962, 402460, 383309, 383375, 382738, 459065, 384084, 382541, 390528, 383229, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for lv sentence:
Eiropas autoceļš E77
XLM-V ids: [0, 3477, 121549, 619, 181, 6697, 2]
HF ids: [0, 3477, 121549, 619, 181, 6697, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for mk sentence:
Поретко , на пример во делови од Пиринска Македонија и Егејска Македонија некои од горните женски облеки – ‘’’саите’’’ се кроеле од домашно ткаено платно во сина боја .
XLM-V ids: [0, 186970, 192733, 187180, 6, 4, 186882, 188182, 186930, 201221, 186939, 221926, 187217, 187685, 186883, 248608, 211453, 187685, 193651, 186939, 240530, 198728, 186987, 187184, 186991, 39, 14464, 42, 187373, 186961, 11099, 42, 186894, 203637, 197766, 186939, 210461, 6, 189541, 188031, 212555, 186930, 194795, 199817, 6, 5, 2]
HF ids: [0, 186970, 192733, 187180, 6, 4, 186882, 188182, 186930, 201221, 186939, 221926, 187217, 187685, 186883, 248608, 211453, 187685, 193651, 186939, 240530, 198728, 186987, 187184, 186991, 39, 14464, 42, 187373, 186961, 42, 11099, 186894, 203637, 197766, 186939, 210461, 6, 189541, 188031, 212555, 186930, 194795, 199817, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for ml sentence:
അനു എലിസബത്ത് ജോസ്
XLM-V ids: [0, 397569, 385011, 528343, 388795, 385776, 481383, 2]
HF ids: [0, 397569, 385011, 528343, 388795, 385776, 481383, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for ms sentence:
███ Sidang Kemuncak Asia Timur
XLM-V ids: [0, 6, 369908, 377468, 593458, 3944, 664695, 8451, 551742, 2]
HF ids: [0, 6, 377468, 369908, 593458, 3944, 664695, 8451, 551742, 2]
------------------------------------------------------------------------------------------
Mismatch for no sentence:
De siste tre semestre var han i Grenoble i Frankrike , der mye av fritiden ble tilbrakt i Les2alpes og LaGrave .
XLM-V ids: [0, 447, 550187, 17752, 611647, 246, 25684, 28, 657552, 28, 557692, 6, 4, 2860, 549299, 15446, 617530, 117029, 664714, 28, 17112, 430, 460, 10083, 6995, 1079, 29815, 383, 6, 5, 2]
HF ids: [0, 447, 550187, 17752, 611647, 246, 25684, 28, 657552, 28, 557692, 6, 4, 2860, 549299, 15446, 617530, 117029, 664714, 28, 17112, 430, 460, 10083, 6995, 1079, 597, 573563, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for or sentence:
ଲେଉଟାଣି ଜୋହାନ୍ ଅଗଷ୍ଟସ ଆର୍ଫୱେଡ଼ସନ୍
XLM-V ids: [0, 6, 387665, 391689, 393963, 403921, 393333, 392380, 395060, 388377, 522433, 387310, 6, 476299, 398439, 432754, 392919, 424507, 2]
HF ids: [0, 6, 387665, 391689, 393963, 403921, 393333, 392380, 395060, 388377, 522433, 387310, 6, 476299, 398439, 432754, 392919, 424507, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for sh sentence:
Kefej ( kralj Tegeje )
XLM-V ids: [0, 3944, 12705, 18, 793761, 96767, 382, 1057, 2]
HF ids: [0, 3944, 12705, 18, 793761, 96767, 382, 1057, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for sl sentence:
__________10__________ Eugenio Siena Alfa Romeo
XLM-V ids: [0, 272238, 1741, 666448, 12002, 848378, 836660, 26591, 72466, 2]
HF ids: [0, 272238, 1741, 12002, 666448, 848378, 836660, 26591, 72466, 2]
------------------------------------------------------------------------------------------
Mismatch for sr sentence:
Прерасподела доходка , Економски факултет Београд USJF - Preraspodela dohotka.ppt
XLM-V ids: [0, 188107, 189047, 187172, 192298, 190169, 186948, 6, 4, 228329, 186887, 192995, 190449, 15373, 662660, 20, 1182, 120, 793095, 567795, 656994, 90130, 5, 457258, 2]
HF ids: [0, 188107, 189047, 187172, 192298, 190169, 186948, 6, 4, 228329, 186887, 192995, 190449, 15373, 662660, 20, 1182, 120, 793095, 567795, 656994, 90130, 5, 457258, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for te sentence:
దారిమార్పు ఇండియన్ ఇన్స్టిట్యూట్ ఆఫ్ టెక్నాలజీ మద్రాస్
XLM-V ids: [0, 436137, 464065, 387183, 460474, 400919, 520935, 493353, 384438, 397587, 466836, 385426, 480198, 383019, 2]
HF ids: [0, 436137, 464065, 387183, 460474, 400919, 520935, 493353, 384438, 397587, 466836, 385426, 480198, 383019, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for ur sentence:
جاوید شیخ - جاوید
XLM-V ids: [0, 408290, 389645, 20, 408290, 2]
HF ids: [0, 408290, 389645, 20, 408290, 6, 2]
------------------------------------------------------------------------------------------
Mismatch for uz sentence:
Dastlab Oltin Oʻrdattt asosiy siyosiy markazi hisoblangan .
XLM-V ids: [0, 61568, 14, 3181, 586435, 43, 122, 1476, 47569, 211172, 14, 15966, 43523, 22564, 42030, 7050, 6, 5, 2]
HF ids: [0, 61568, 14, 3181, 586435, 43, 122, 1476, 47569, 14, 211172, 15966, 43523, 22564, 42030, 7050, 6, 5, 2]
------------------------------------------------------------------------------------------
Mismatch for zh-yue sentence:
R E D I R E C T # 巴 菲 特
XLM-V ids: [0, 266, 181, 205, 168, 266, 181, 232, 157, 524, 335519, 6, 286994, 6, 283738, 2]
HF ids: [0, 266, 181, 205, 168, 266, 181, 232, 157, 524, 335519, 6, 286994, 6, 283738, 6, 2]
------------------------------------------------------------------------------------------
Can we tolerate these mismatches :thinking:
Model is up now on the model hub:
https://huggingface.co/stefan-it/xlm-v-base
-> I would like to conduct some experiments on downstream tasks (mainly NER) to measure performance.
Maybe e.g. @mrm8488 also wants to fine-tune models so that we can try to reproduce some of the paper results :)
After some experiments I can transfer the model to the Meta AI organization. The MLM performance is really good, so the model should work:
In [3]: unmasker("Paris is the <mask> of France.")
Out[3]:
[{'score': 0.9286897778511047,
'token': 133852,
'token_str': 'capital',
'sequence': 'Paris is the capital of France.'},
{'score': 0.018073994666337967,
'token': 46562,
'token_str': 'Capital',
'sequence': 'Paris is the Capital of France.'},
{'score': 0.013238662853837013,
'token': 8696,
'token_str': 'centre',
'sequence': 'Paris is the centre of France.'},
{'score': 0.010450296103954315,
'token': 550136,
'token_str': 'heart',
'sequence': 'Paris is the heart of France.'},
{'score': 0.005028395913541317,
'token': 60041,
'token_str': 'center',
'sequence': 'Paris is the center of France.'}]
Thank you so much @stefan-it. Ofc, I will try to reproduce some of the reported results.
I've replicated the MasakhaNER v1 results from the paper:
I fine-tuned 5 models (with different seeds) on the English WikiANN (Rahimi split) and evaluated them on MasakhaNER v1. Note: DATE
entities do not exist in WikiANN, so they were replaced with O
for zero-shot evaluation. I averaged F1-Score over the 5 models to get the final score. Models were fine-tuned with a sequence length of 512 (paper uses 128, I recognized this after fine-tuning experiments), but other hyper-parameter are the same as used in XLM-V paper: Batch size is 32, learning rate 2e-05 and number of epochs is 10.
Putting it all together (see Table 11 in XLM-V paper):
Model | amh | hau | ibo | kin | lug | luo | pcm | swa | wol | yor | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|
XLM-R (Paper) | 25.1 | 43.5 | 11.6 | 9.4 | 9.5 | 8.4 | 36.8 | 48.9 | 5.3 | 10.0 | 20.9 |
XLM-R (Reproduced) | 27.1 | 42.4 | 14.2 | 12.4 | 14.3 | 10.0 | 40.6 | 50.2 | 6.3 | 11.5 | 22.9 |
XLM-V (Paper) | 20.6 | 35.9 | 45.9 | 25.0 | 48.7 | 10.4 | 38.2 | 44.0 | 16.7 | 35.8 | 32.1 |
XLM-V (Reproduced) | 25.3 | 45.7 | 55.6 | 33.2 | 56.1 | 16.5 | 40.7 | 50.8 | 26.3 | 47.2 | 39.7 |
Performance diff for WikiANN between XLM-R and XLM-V in the paper is 11.2%. Reproduced experiments gave an performance diff of 16.8%.
So I think these experiments show, that the model is working and it achieves great results on MasakhaNER v1!
I will set-up a repository for all these results and conduct more experiments on WikiANN (second NER downstream tasks that is mentioned in in the paper).
@patrickvonplaten Do you think the model is then ready to be moved to the Meta AI org? I've also written an initial model card.
Here's the comparison on WikiANN zero-shot (see Table10 in XLM-V paper):
Model | ro | gu | pa | lt | az | uk | pl | qu | hu | fi | et | tr | kk | zh | my | yo | sw |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XLM-R (Paper) | 73.5 | 62.9 | 53.6 | 72.7 | 61.0 | 72.4 | 77.5 | 60.4 | 75.8 | 74.4 | 71.2 | 75.4 | 42.2 | 25.3 | 48.9 | 33.6 | 66.3 |
XLM-R (Reproduced) | 73.8 | 65.5 | 50.6 | 74.3 | 64.0 | 76.5 | 78.4 | 60.8 | 77.7 | 75.9 | 73.0 | 76.4 | 45.2 | 29.8 | 52.3 | 37.6 | 67.0 |
XLM-V (Paper) | 73.8 | 66.4 | 48.7 | 75.6 | 66.7 | 65.7 | 79.5 | 70.0 | 79.5 | 78.7 | 75.0 | 77.3 | 50.4 | 30.2 | 61.5 | 54.2 | 72.4 |
XLM-V (Reproduced) | 77.2 | 65.4 | 53.6 | 74.9 | 66.0 | 69.4 | 79.8 | 66.9 | 79.0 | 77.9 | 76.2 | 76.8 | 48.5 | 28.1 | 58.4 | 62.6 | 71.6 |
Model | th | ko | ka | ja | ru | bg | es | pt | it | fr | fa | ur | mr | hi | bn | el | de |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XLM-R (Paper) | 5.2 | 49.4 | 65.4 | 21.0 | 63.1 | 76.1 | 70.2 | 77.0 | 76.9 | 76.5 | 44.6 | 51.4 | 61.5 | 67.2 | 69.0 | 73.8 | 74.4 |
XLM-R (Reproduced) | 4.7 | 49.4 | 67.5 | 21.9 | 65.2 | 77.5 | 76.7 | 79.0 | 77.7 | 77.9 | 49.0 | 55.1 | 61.3 | 67.8 | 69.6 | 74.1 | 75.4 |
XLM-V (Paper) | 3.3 | 53.0 | 69.5 | 22.4 | 68.1 | 79.8 | 74.5 | 80.5 | 78.7 | 77.6 | 50.6 | 48.9 | 59.8 | 67.3 | 72.6 | 76.7 | 76.8 |
XLM-V (Reproduced) | 2.6 | 51.6 | 71.2 | 20.6 | 67.8 | 79.4 | 76.2 | 79.9 | 79.5 | 77.5 | 51.7 | 51.5 | 61.9 | 69.2 | 73.2 | 75.9 | 77.1 |
Model | en | nl | af | te | ta | ml | eu | tl | ms | jv | id | vi | he | ar | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XLM-R (Paper) | 83.0 | 80.0 | 75.8 | 49.2 | 56.3 | 61.9 | 57.2 | 69.8 | 68.3 | 59.4 | 48.6 | 67.7 | 53.2 | 43.8 | 61.3 |
XLM-R (Reproduced) | 83.4 | 80.8 | 75.8 | 49.3 | 56.8 | 62.2 | 59.1 | 72.2 | 62.3 | 58.3 | 50.0 | 67.9 | 52.6 | 47.8 | 62.6 |
XLM-V (Paper) | 83.4 | 81.4 | 78.3 | 51.8 | 54.9 | 63.1 | 67.1 | 75.6 | 70.0 | 67.5 | 52.6 | 67.1 | 60.1 | 45.8 | 64.7 |
XLM-V (Reproduced) | 84.1 | 81.3 | 78.9 | 50.9 | 55.9 | 63.0 | 65.7 | 75.9 | 70.8 | 64.8 | 53.9 | 69.6 | 61.1 | 47.2 | 65.0 |
Diff. between XLM-V and XLM-R in the paper: (64.7 - 61.3) = 3.4%. Diff. between reproduced XLM-V and XLM-R: (65.0 - 62.6) = 2.4%.
Same conclusion: the converted/integrated XLM-V works great :hugs:
Great job @stefan-it !!! 🔥
Thanks @mrm8488 !
Repo is btw: up here: https://github.com/stefan-it/xlm-v-experiments :)
Thanks a lot for your contribution @stefan-it 🙏
Just transferred the checkpoint to the appropriate organization: https://huggingface.co/facebook/xlm-v-base
However, I feel like it could be beneficial to have a separate model_doc for XLM-V (similar to how we did this for T5v1.1 etc.).
Do you mind opening a PR for that?
Thanks! Closing this issue as the model is now available: https://huggingface.co/docs/transformers/main/en/model_doc/xlm-v.
Amazing work @stefan-it - thanks a lot!
Amazing @stefan-it . Should I add some ft metric @patrickvonplaten as done for other models? I fine-tuned it on XNLI: https://huggingface.co/mrm8488/xlm-v-base-finetuned-xglue-xnli
Model description
XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models
Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).
Should work as XLM-RoBERTa
Open source status
Provide useful links for the implementation
No response