huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.28k stars 26.85k forks source link

Vocab size difference between tokenizer and config for XLMR. #12668

Closed erip closed 3 years ago

erip commented 3 years ago

Environment info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

Who can help

@LysandreJik maybe?

Information

Model I am using (Bert, XLNet ...): XLM Roberta

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

>>> from transformers.models.xlm_roberta import XLMRobertaConfig
>>> XLMRobertaConfig().vocab_size
30522
>>> from transformers import AutoTokenizer
>>> AutoTokenizer.from_pretrained('xlm-roberta-base').vocab_size
250002

Expected behavior

I expect the vocab sizes to be the same.

LysandreJik commented 3 years ago

Hello! If you want the configuration and tokenizer to match the same checkpoint, you should load them from same checkpoint:

>>> from transformers import XLMRobertaConfig
>>> XLMRobertaConfig.from_pretrained('xlm-roberta-base').vocab_size
250002
>>> from transformers import AutoTokenizer
>>> AutoTokenizer.from_pretrained('xlm-roberta-base').vocab_size
250002
erip commented 3 years ago

Thanks, @LysandreJik. I guess fundamentally my question isn't just "how do I get the expected vocab size", but also "why is the default size wrong"? The vocab with size 30522 is from BERT; XLM-R has no configuration in which this vocab size is used. Why doesn't the config represent the config used in the paper?

LysandreJik commented 3 years ago

The issue is that the configuration of this model is a simpler wrapper over RoBERTa since it's basically a copy of that model.

I do agree that this is misleading however, as it puts the wrong defaults. We should make the two configurations independent and provide the correct defaults for XLM-R.

Would you like to open a PR to propose a fix for this?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.