Should mMiniLMv2 be paired with the tokenizer of mMiniLMv1?

microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

https://aka.ms/GeneralAI

MIT License

19.09k stars 2.44k forks source link

Should mMiniLMv2 be paired with the tokenizer of mMiniLMv1? #1493

Open wencan opened 3 months ago

wencan commented 3 months ago

I downloaded mMiniLMv2. The compressed package only contains the model file and no tokenizer information. However, from the shape of the embedding, it seems that mMiniLMv2 and mMiniLMv2 may use the same tokenizer.

like this:

from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")