Closed erip closed 3 years ago
Hello! If you want the configuration and tokenizer to match the same checkpoint, you should load them from same checkpoint:
>>> from transformers import XLMRobertaConfig
>>> XLMRobertaConfig.from_pretrained('xlm-roberta-base').vocab_size
250002
>>> from transformers import AutoTokenizer
>>> AutoTokenizer.from_pretrained('xlm-roberta-base').vocab_size
250002
Thanks, @LysandreJik. I guess fundamentally my question isn't just "how do I get the expected vocab size", but also "why is the default size wrong"? The vocab with size 30522 is from BERT; XLM-R has no configuration in which this vocab size is used. Why doesn't the config represent the config used in the paper?
The issue is that the configuration of this model is a simpler wrapper over RoBERTa since it's basically a copy of that model.
I do agree that this is misleading however, as it puts the wrong defaults. We should make the two configurations independent and provide the correct defaults for XLM-R.
Would you like to open a PR to propose a fix for this?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.8.2Who can help
@LysandreJik maybe?
Information
Model I am using (Bert, XLNet ...): XLM Roberta
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
I expect the vocab sizes to be the same.