Embedding layer vocab size not match to tokenizer length

microsoft / DeBERTa

The implementation of DeBERTa

MIT License

1.91k stars 215 forks source link

DebertaV2Model( (embeddings): DebertaV2Embeddings( (word_embeddings): Embedding(128100, 1024, padding_idx=0) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() )

Hi @kingbone9 ,

this is a really good question. The original SPM vocab has a size of 128,000:

import sentencepiece as spm

vocab_file = "./spm.model"
sp_model = spm.SentencePieceProcessor()
sp_model.Load(vocab_file)
print(sp_model.vocab_size())

returns 128000.

The DeBERTa config uses 128100, so 100 extra tokens are added. I've also seen this behaviour when pretraining an own model. I used 32,000 as vocab size, defined 32,000 as vocab size in the configuration file, but pretraining crashes with strange CUDA errors. After using 32,100 in the configuration file, it works.

I still need to do an investigation, where these +100 extra tokens exactly come from in the code. I will report back, when I found it!

microsoft / DeBERTa

Embedding layer vocab size not match to tokenizer length #103