microsoft / DeBERTa

The implementation of DeBERTa
MIT License
1.91k stars 215 forks source link

Embedding layer vocab size not match to tokenizer length #103

Open kingbone9 opened 2 years ago

kingbone9 commented 2 years ago

when I run this code, it shows that the length of tokenizer equals to 128001.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
len(tokenizer)

But when I load model parameter in embeddings layer, the vocab_size is 128100.

DebertaV2Model(
  (embeddings): DebertaV2Embeddings(
    (word_embeddings): Embedding(128100, 1024, padding_idx=0)
    (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
    (dropout): StableDropout()
  )
stefan-it commented 1 year ago

Hi @kingbone9 ,

this is a really good question. The original SPM vocab has a size of 128,000:

import sentencepiece as spm

vocab_file = "./spm.model"
sp_model = spm.SentencePieceProcessor()
sp_model.Load(vocab_file)
print(sp_model.vocab_size())

returns 128000.

The DeBERTa config uses 128100, so 100 extra tokens are added. I've also seen this behaviour when pretraining an own model. I used 32,000 as vocab size, defined 32,000 as vocab size in the configuration file, but pretraining crashes with strange CUDA errors. After using 32,100 in the configuration file, it works.

I still need to do an investigation, where these +100 extra tokens exactly come from in the code. I will report back, when I found it!