microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.11k stars 2.44k forks source link

LayoutXLM model Special tokens have been added in the vocabulary. #625

Open oguz-akkas-deepsee opened 2 years ago

oguz-akkas-deepsee commented 2 years ago

Describe Model I am using LayoutXLM

I am trying to fine-tune the model with Question-Answer pair. When I use "microsoft/layoutlmv2-base-uncased", there is no problem with the model or data. When I switched to "microsoft/layoutxlm-base" I got this "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained" warning. Once I start tokenization I got "ValueError: Id not recognized" error at this point `encoding = tokenizer( questions, words, boxes, max_length=max_length, padding="max_length", truncation=True,

add_special_tokens=False

)`

If I add "add_special_tokens=False" this as encoding = tokenizer( questions, words, boxes, max_length=max_length, padding="max_length", truncation=True, add_special_tokens=False ) I got "ValueError: 250005 is not in list" error.

oguz-akkas-deepsee commented 2 years ago

When I used "tokenizer = LayoutXLMTokenizer.from_pretrained(model_checkpoint_xml)" instead of "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_xml)" I dont see above error. However now I don't get sequence_ids since it says "ValueError: sequence_ids() is not available when using Python-based tokenizers". When I use LayoutMv2Tokenizer, it gives me sequence_ids. Not sure what I am missing here.