Bug?/Question? Vocab of RoBERTa different from GPT2

KaiserWhoLearns commented 3 years ago

Environment info

transformers version: 4.5.1
Platform: Linux-5.11.0-34-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.11
PyTorch version (GPU?): 1.9.1+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

GPT-2, GPT: @patrickvonplaten, @LysandreJik
If the model isn't in the list, ping @LysandreJik who will redirect you to the correct contributor. (Not sure who I should ping for RoBERTa)

Library:

Tokenizers: @LysandreJik

Problem

Model I am using: GPT-2, RoBERTa

The problem arises when I ran:

RobertaTokenizer.from_pretrained('roberta-large').vocab_size
Output: 50265
GPT2Tokenizer.from_pretrained('gpt2').vocab_size
Output: 50257

Expected behavior

Since RoBERTa and GPT-2 share vocabulary, are they supposed to have equal vocab_size? Not sure if this is a question or bug, so I put it here. If this is intended, may I ask where the difference comes from?

patil-suraj commented 3 years ago

Hi there! GPT-2 and RoBERTa both use byte-level Byte-Pair-Encoding for tokenization but they are different tokenizers, trained on different datasets with different vocab_size. So this is not a bug. They share the tokenization method but are essentially different tokenizers.

Also, please use the forum for such questions. Thank you :)

KaiserWhoLearns commented 3 years ago

Hi there! GPT-2 and RoBERTa both use byte-level Byte-Pair-Encoding for tokenization but they are different tokenizers, trained on different datasets with different vocab_size. So this is not a bug. They share the tokenization method but are essentially different tokenizers.

Also, please use the forum for such questions. Thank you :)

I see, thank you very much! I am confused a bit because of seeing paper like these (https://aclanthology.org/2020.tacl-1.18.pdf, https://aclanthology.org/2020.emnlp-main.344.pdf) using GPT2 and RoBERTa since they share the vocab..

huggingface / transformers