microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

MIT License

19.11k stars 2.44k forks source link

Describe Model I am using LayoutXLM

I am trying to fine-tune the model with Question-Answer pair. When I use "microsoft/layoutlmv2-base-uncased", there is no problem with the model or data. When I switched to "microsoft/layoutxlm-base" I got this "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained" warning. Once I start tokenization I got "ValueError: Id not recognized" error at this point `encoding = tokenizer( questions, words, boxes, max_length=max_length, padding="max_length", truncation=True,

add_special_tokens=False

)`

If I add "add_special_tokens=False" this as encoding = tokenizer( questions, words, boxes, max_length=max_length, padding="max_length", truncation=True, add_special_tokens=False ) I got "ValueError: 250005 is not in list" error.

microsoft / unilm

LayoutXLM model Special tokens have been added in the vocabulary. #625

add_special_tokens=False