huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.01k stars 27.01k forks source link

Bug?/Question? Vocab of RoBERTa different from GPT2 #13971

Closed KaiserWhoLearns closed 3 years ago

KaiserWhoLearns commented 3 years ago

Environment info

Who can help

Library:

Problem

Model I am using: GPT-2, RoBERTa

The problem arises when I ran:

RobertaTokenizer.from_pretrained('roberta-large').vocab_size
Output: 50265
GPT2Tokenizer.from_pretrained('gpt2').vocab_size
Output: 50257

Expected behavior

Since RoBERTa and GPT-2 share vocabulary, are they supposed to have equal vocab_size? Not sure if this is a question or bug, so I put it here. If this is intended, may I ask where the difference comes from?

patil-suraj commented 3 years ago

Hi there! GPT-2 and RoBERTa both use byte-level Byte-Pair-Encoding for tokenization but they are different tokenizers, trained on different datasets with different vocab_size. So this is not a bug. They share the tokenization method but are essentially different tokenizers.

Also, please use the forum for such questions. Thank you :)

KaiserWhoLearns commented 3 years ago

Hi there! GPT-2 and RoBERTa both use byte-level Byte-Pair-Encoding for tokenization but they are different tokenizers, trained on different datasets with different vocab_size. So this is not a bug. They share the tokenization method but are essentially different tokenizers.

Also, please use the forum for such questions. Thank you :)

I see, thank you very much! I am confused a bit because of seeing paper like these (https://aclanthology.org/2020.tacl-1.18.pdf, https://aclanthology.org/2020.emnlp-main.344.pdf) using GPT2 and RoBERTa since they share the vocab..