Vocab size difference between tokenizer and config for XLMR.

erip commented 3 years ago

Environment info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.8.2
Platform: macOS-10.15.7-x86_64-i386-64bit
Python version: 3.8.10
PyTorch version (GPU?): 1.9.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik maybe?

Information

Model I am using (Bert, XLNet ...): XLM Roberta

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

>>> from transformers.models.xlm_roberta import XLMRobertaConfig
>>> XLMRobertaConfig().vocab_size
30522
>>> from transformers import AutoTokenizer
>>> AutoTokenizer.from_pretrained('xlm-roberta-base').vocab_size
250002

Expected behavior

I expect the vocab sizes to be the same.

LysandreJik commented 3 years ago

Hello! If you want the configuration and tokenizer to match the same checkpoint, you should load them from same checkpoint:

>>> from transformers import XLMRobertaConfig
>>> XLMRobertaConfig.from_pretrained('xlm-roberta-base').vocab_size
250002
>>> from transformers import AutoTokenizer
>>> AutoTokenizer.from_pretrained('xlm-roberta-base').vocab_size
250002

erip commented 3 years ago

Thanks, @LysandreJik. I guess fundamentally my question isn't just "how do I get the expected vocab size", but also "why is the default size wrong"? The vocab with size 30522 is from BERT; XLM-R has no configuration in which this vocab size is used. Why doesn't the config represent the config used in the paper?

LysandreJik commented 3 years ago

The issue is that the configuration of this model is a simpler wrapper over RoBERTa since it's basically a copy of that model.

I do agree that this is misleading however, as it puts the wrong defaults. We should make the two configurations independent and provide the correct defaults for XLM-R.

Would you like to open a PR to propose a fix for this?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers