Open kingbone9 opened 2 years ago
Hi @kingbone9 ,
this is a really good question. The original SPM vocab has a size of 128,000:
import sentencepiece as spm
vocab_file = "./spm.model"
sp_model = spm.SentencePieceProcessor()
sp_model.Load(vocab_file)
print(sp_model.vocab_size())
returns 128000
.
The DeBERTa config uses 128100
, so 100 extra tokens are added. I've also seen this behaviour when pretraining an own model. I used 32,000 as vocab size, defined 32,000 as vocab size in the configuration file, but pretraining crashes with strange CUDA errors. After using 32,100 in the configuration file, it works.
I still need to do an investigation, where these +100 extra tokens exactly come from in the code. I will report back, when I found it!
when I run this code, it shows that the length of tokenizer equals to 128001.
But when I load model parameter in embeddings layer, the vocab_size is 128100.