Closed TheEdoardo93 closed 4 years ago
cc @aconneau 😬
is there any update in the XLM-R model?
Let me know if you need some-help in porting the xlm-r models to HF.
I think that's maybe not the correct way, but I adjusted the convert_roberta_original_pytorch_checkpoint_to_pytorch.py
script to convert the fairseq
model into a transformers
compatible model file. I used the sentencepiece
BPE loader and adjusted the vocab size.
Then I used the CamemBERT
model class to perform some evaluations on NER. But the result is not really good (I tried to replicate the CoNLL-2003 for English).
So I guess it is not as simple as this first attempt 😅
Gist for the conversion script is here.
The CamemBERT
model configuration looks pretty much the same as XLM-R large?!
I think that's maybe not the correct way, but I adjusted the
convert_roberta_original_pytorch_checkpoint_to_pytorch.py
script to convert thefairseq
model into atransformers
compatible model file. I used thesentencepiece
BPE loader and adjusted the vocab size.Then I used the
CamemBERT
model class to perform some evaluations on NER. But the result is not really good (I tried to replicate the CoNLL-2003 for English).So I guess it is not as simple as this first attempt 😅
Gist for the conversion script is here.
The
CamemBERT
model configuration looks pretty much the same as XLM-R large?!
Hi @stefan-it, do you have any update for your attempt?
The final models have been released today 😍
https://github.com/pytorch/fairseq/tree/master/examples/xlmr
So I'm going to try the conversion with these models tomorrow/in the next days :)
I think the model conversion is done correctly. But: the CamembertTokenizer
implementation can't be used, because it adds some special tokens. I had to modify the tokenizer to match the output of the fairseq
tokenization/.encode()
method :) I'll report back some results on NER later.
update: I could achieve 90.41% on CoNLL-2003 (English), paper reports 92.74 (using Flair).
update 2: Using the run_ner.py
example (incl. some hours of tokenization debugging...): 96.22 (dev) and 91.91 (test).
Btw I was using the XLM-R v0 checkpoints in a project I'm working on and the v0 checkpoints worked slightly better than the checkpoints added today. Is it possible to also add the older checkpoints?
I think it's the best solution to offer both checkpoint versions! In my opinion, the ideal case is that, as like to other models in Transformers, you can select which version of XLM-R checkpoints to use, e.g.
> from transformers import XLMRModel
> base_model = XLMRModel.from_pretrained('xlmr-base') # 250M parameters
> large_model = XLMRModel.from_pretrained('xlmr-large') # 560M parameters
Btw I was using the XLM-R v0 checkpoints in a project I'm working on and the v0 checkpoints worked slightly better than the checkpoints added today. Is it possible to also add the older checkpoints?
Btw using XLM-R I encounter this issue: Batch size affecting output. #2401
This is really annoying and makes it hard to use the model.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@ricardorei Did you happen to successfully use the XLM-R model ?
I'm trying to see how this model can be used as pretrained step for NMT tasks, I tried raw version from XLM facebook repo and ran into multiple OOM issues.
The best suggestion so far I got is to try smaller version of Fairseq xlmr (base) on p3dn.24xlarge instance or the Google TPU Pytorch way.
Thanks !
@mohammedayub44
I am using the base model which runs well in a 12GB GPU with batch size of 8. Depending on your implementation and task you can run even bigger batches (16, 24 for example).
And I am also using the version directly from Fairseq, because you can load the v0 checkpoint.
The variability in my prediction with different batch sizes I could never figure out. Probably some floating-point precision issues going on under the hood. It doesn't change overall performance but it is annoying...
BTW, I am using the TF variant from https://huggingface.co/jplu/tf-xlm-roberta-base and https://huggingface.co/jplu/tf-xlm-roberta-large . I have successfully finetuned even the large model on a 16GB GPU and it was performing substantially better than the base model (on Czech Q&A).
@ricardorei Thanks for the confirmation. I'm okay with v0 checkpoints, I just need to check if the model can be fine-tuned for NMT. I'm guessing you're fine tuning for Classification tasks.
If you could share the prepare and train commands you are using. It would be easier than digging deep into every fairseq hyperparamter.
Thanks !
@foxik Is TF variant more suitable for fine-tuning. Any particular preprocessing steps you carried out for fine-tuning. If you can share them, I can map the same for NMT task.
Thanks !
@mohammedayub44 Yes I was using it for classification/regression. In your case, you need the encoder and decoder part which would take a lot more space. I would suggest that you share parameters between you encoder and decoder.
I know that, with the right hyperparameter, you can achieve good results by sharing the parameters between your encoder and decoder -> A Simple and Effective Approach to Automatic Post-Editing with Transfer Learning
In terms of hyperparameters that I am using, they are very simple. I freeze the encoder for 1 epoch while fine-tuning the classification head and then I fine-tune the entire model. My classification-head has a learning rate of 0.00003 while XLM-R has 0.00001. The optimizer is a standard Adam. This combination of gradual unfreezing with discriminative learning rates works well in my task.
@ricardorei Thanks for sharing the paper. Some interesting results there. Any hints on how I can setup both encoder and decoder of XLM-R and share the parameters using HuggingFace library. I could only find LM fine-tuning examples and notebook file. Nothing on NMT based fine-tuning.
🌟New model addition
Model description
Yesterday, Facebook has released open source its new NLG model called XLM-R (XLM-RoBERTa) on arXiv. This model uses self-supervised training techniques to achieve state-of-the-art performance in cross-lingual understanding, a task in which a model is trained in one language and then used with other languages without additional training data. Our model improves upon previous multilingual approaches by incorporating more training data and languages — including so-called low-resource languages, which lack extensive labeled and unlabeled data sets.
Open Source status
Additional context
Facebook says these two sentences about this new model in their blog: