microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.07k stars 2.54k forks source link

mMiniLM embedding layer and tokenizer has different size. #313

Closed thomas-happify closed 3 years ago

thomas-happify commented 3 years ago

Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...): mMiniLM

The problem arises when using:

A clear and concise description of what the bug is. mMiniLM embedding layer and tokenizer has different size.

To Reproduce Steps to reproduce the behavior:

from transformers import (
    AutoModel, 
    XLMRobertaTokenizer
)

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
minilm = AutoModel.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")

print(minilm.embeddings.word_embeddings, len(tokenizer))

xlmr = AutoModel.from_pretrained("xlm-roberta-base")
print(xlmr.embeddings.word_embeddings, len(tokenizer))

Embedding(250037, 384, padding_idx=0) 250002
Embedding(250002, 768, padding_idx=1) 250002

Expected behavior A clear and concise description of what you expected to happen. shouldn't embedding vocab_size equal to tokenizer size?

wenhui0924 commented 3 years ago

Hi @thomas-happify, there are some unused tokens (id: 250002-250036) in mMiniLM's vocab and the 0-250001 tokens are the same as XLMR. There are two methods to fix the issuse. 1) You could refer to our fine-tuning example code on XNLI. The example code is not based on the AutoModel in Transformers. You may need to modify your code. 2) You could remove the unused embeddings (id: 250002-250036) in mMiniLM checkpoint to load the model.

Thanks

thomas-happify commented 3 years ago

@WenhuiWang0824 Thanks a lot! Do you mind explaining why exactly mMiniLM had extra tokens? I just want to understand thoroughly.

Thanks!