mMiniLM embedding layer and tokenizer has different size.

thomas-happify commented 3 years ago

Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...): mMiniLM

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

A clear and concise description of what the bug is. mMiniLM embedding layer and tokenizer has different size.

To Reproduce Steps to reproduce the behavior:

from transformers import (
    AutoModel, 
    XLMRobertaTokenizer
)

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
minilm = AutoModel.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")

print(minilm.embeddings.word_embeddings, len(tokenizer))

xlmr = AutoModel.from_pretrained("xlm-roberta-base")
print(xlmr.embeddings.word_embeddings, len(tokenizer))

Embedding(250037, 384, padding_idx=0) 250002
Embedding(250002, 768, padding_idx=1) 250002

Expected behavior A clear and concise description of what you expected to happen. shouldn't embedding vocab_size equal to tokenizer size?

Platform: linux
Python version: 3.7
PyTorch version (GPU?): 1.7.1+cu92

wenhui0924 commented 3 years ago

Hi @thomas-happify, there are some unused tokens (id: 250002-250036) in mMiniLM's vocab and the 0-250001 tokens are the same as XLMR. There are two methods to fix the issuse. 1) You could refer to our fine-tuning example code on XNLI. The example code is not based on the AutoModel in Transformers. You may need to modify your code. 2) You could remove the unused embeddings (id: 250002-250036) in mMiniLM checkpoint to load the model.

Thanks

thomas-happify commented 3 years ago

@WenhuiWang0824 Thanks a lot! Do you mind explaining why exactly mMiniLM had extra tokens? I just want to understand thoroughly.

Thanks!

microsoft / unilm

mMiniLM embedding layer and tokenizer has different size. #313