Open oguz-akkas-deepsee opened 2 years ago
When I used "tokenizer = LayoutXLMTokenizer.from_pretrained(model_checkpoint_xml)" instead of "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_xml)" I dont see above error. However now I don't get sequence_ids since it says "ValueError: sequence_ids() is not available when using Python-based tokenizers". When I use LayoutMv2Tokenizer, it gives me sequence_ids. Not sure what I am missing here.
Describe Model I am using LayoutXLM
I am trying to fine-tune the model with Question-Answer pair. When I use "microsoft/layoutlmv2-base-uncased", there is no problem with the model or data. When I switched to "microsoft/layoutxlm-base" I got this "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained" warning. Once I start tokenization I got "ValueError: Id not recognized" error at this point `encoding = tokenizer( questions, words, boxes, max_length=max_length, padding="max_length", truncation=True,
add_special_tokens=False
If I add "add_special_tokens=False" this as
encoding = tokenizer( questions, words, boxes, max_length=max_length, padding="max_length", truncation=True, add_special_tokens=False )
I got "ValueError: 250005 is not in list" error.