huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.28k stars 26.85k forks source link

[XLM-R] by Facebook AI Research #1769

Closed TheEdoardo93 closed 4 years ago

TheEdoardo93 commented 4 years ago

🌟New model addition

Model description

Yesterday, Facebook has released open source its new NLG model called XLM-R (XLM-RoBERTa) on arXiv. This model uses self-supervised training techniques to achieve state-of-the-art performance in cross-lingual understanding, a task in which a model is trained in one language and then used with other languages without additional training data. Our model improves upon previous multilingual approaches by incorporating more training data and languages — including so-called low-resource languages, which lack extensive labeled and unlabeled data sets.

Open Source status

Additional context

Facebook says these two sentences about this new model in their blog:

XLM-R represents an important step toward our vision of providing the best possible experience on our platforms for everyone, regardless of what language they speak

We hope to improve the performance of multilingual models created by the research community, particularly systems that use self-supervised training methods to better understand low-resource languages.

XLM-R has been trained on 2.5T of data across 100 languages data filtered from Common Crawl

julien-c commented 4 years ago

cc @aconneau 😬

ricardorei commented 4 years ago

is there any update in the XLM-R model?

ngoyal2707 commented 4 years ago

Let me know if you need some-help in porting the xlm-r models to HF.

stefan-it commented 4 years ago

I think that's maybe not the correct way, but I adjusted the convert_roberta_original_pytorch_checkpoint_to_pytorch.py script to convert the fairseq model into a transformers compatible model file. I used the sentencepiece BPE loader and adjusted the vocab size.

Then I used the CamemBERT model class to perform some evaluations on NER. But the result is not really good (I tried to replicate the CoNLL-2003 for English).

So I guess it is not as simple as this first attempt 😅


Gist for the conversion script is here.

The CamemBERT model configuration looks pretty much the same as XLM-R large?!

CZWin32768 commented 4 years ago

I think that's maybe not the correct way, but I adjusted the convert_roberta_original_pytorch_checkpoint_to_pytorch.py script to convert the fairseq model into a transformers compatible model file. I used the sentencepiece BPE loader and adjusted the vocab size.

Then I used the CamemBERT model class to perform some evaluations on NER. But the result is not really good (I tried to replicate the CoNLL-2003 for English).

So I guess it is not as simple as this first attempt 😅

Gist for the conversion script is here.

The CamemBERT model configuration looks pretty much the same as XLM-R large?!

Hi @stefan-it, do you have any update for your attempt?

stefan-it commented 4 years ago

The final models have been released today 😍

https://github.com/pytorch/fairseq/tree/master/examples/xlmr

So I'm going to try the conversion with these models tomorrow/in the next days :)

stefan-it commented 4 years ago

I think the model conversion is done correctly. But: the CamembertTokenizer implementation can't be used, because it adds some special tokens. I had to modify the tokenizer to match the output of the fairseq tokenization/.encode() method :) I'll report back some results on NER later.

update: I could achieve 90.41% on CoNLL-2003 (English), paper reports 92.74 (using Flair). update 2: Using the run_ner.py example (incl. some hours of tokenization debugging...): 96.22 (dev) and 91.91 (test).

ricardorei commented 4 years ago

Btw I was using the XLM-R v0 checkpoints in a project I'm working on and the v0 checkpoints worked slightly better than the checkpoints added today. Is it possible to also add the older checkpoints?

TheEdoardo93 commented 4 years ago

I think it's the best solution to offer both checkpoint versions! In my opinion, the ideal case is that, as like to other models in Transformers, you can select which version of XLM-R checkpoints to use, e.g.

> from transformers import XLMRModel
> base_model = XLMRModel.from_pretrained('xlmr-base') # 250M parameters
> large_model = XLMRModel.from_pretrained('xlmr-large') # 560M parameters

Btw I was using the XLM-R v0 checkpoints in a project I'm working on and the v0 checkpoints worked slightly better than the checkpoints added today. Is it possible to also add the older checkpoints?

ricardorei commented 4 years ago

Btw using XLM-R I encounter this issue: Batch size affecting output. #2401

This is really annoying and makes it hard to use the model.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mohammedayub44 commented 4 years ago

@ricardorei Did you happen to successfully use the XLM-R model ?

I'm trying to see how this model can be used as pretrained step for NMT tasks, I tried raw version from XLM facebook repo and ran into multiple OOM issues.

The best suggestion so far I got is to try smaller version of Fairseq xlmr (base) on p3dn.24xlarge instance or the Google TPU Pytorch way.

Thanks !

ricardorei commented 4 years ago

@mohammedayub44

I am using the base model which runs well in a 12GB GPU with batch size of 8. Depending on your implementation and task you can run even bigger batches (16, 24 for example).

And I am also using the version directly from Fairseq, because you can load the v0 checkpoint.

The variability in my prediction with different batch sizes I could never figure out. Probably some floating-point precision issues going on under the hood. It doesn't change overall performance but it is annoying...

foxik commented 4 years ago

BTW, I am using the TF variant from https://huggingface.co/jplu/tf-xlm-roberta-base and https://huggingface.co/jplu/tf-xlm-roberta-large . I have successfully finetuned even the large model on a 16GB GPU and it was performing substantially better than the base model (on Czech Q&A).

mohammedayub44 commented 4 years ago

@ricardorei Thanks for the confirmation. I'm okay with v0 checkpoints, I just need to check if the model can be fine-tuned for NMT. I'm guessing you're fine tuning for Classification tasks.

If you could share the prepare and train commands you are using. It would be easier than digging deep into every fairseq hyperparamter.

Thanks !

mohammedayub44 commented 4 years ago

@foxik Is TF variant more suitable for fine-tuning. Any particular preprocessing steps you carried out for fine-tuning. If you can share them, I can map the same for NMT task.

Thanks !

ricardorei commented 4 years ago

@mohammedayub44 Yes I was using it for classification/regression. In your case, you need the encoder and decoder part which would take a lot more space. I would suggest that you share parameters between you encoder and decoder.

I know that, with the right hyperparameter, you can achieve good results by sharing the parameters between your encoder and decoder -> A Simple and Effective Approach to Automatic Post-Editing with Transfer Learning

In terms of hyperparameters that I am using, they are very simple. I freeze the encoder for 1 epoch while fine-tuning the classification head and then I fine-tune the entire model. My classification-head has a learning rate of 0.00003 while XLM-R has 0.00001. The optimizer is a standard Adam. This combination of gradual unfreezing with discriminative learning rates works well in my task.

mohammedayub44 commented 4 years ago

@ricardorei Thanks for sharing the paper. Some interesting results there. Any hints on how I can setup both encoder and decoder of XLM-R and share the parameters using HuggingFace library. I could only find LM fine-tuning examples and notebook file. Nothing on NMT based fine-tuning.