Open huschi opened 3 years ago
have you considered the 512 token limit while working on the classification task?
Hi, I also want to use LayoutLM, for custom document classification in 20 categories. So kindly guide me the right approach, or any helping notebook to start with.
Thanks
kindly go through this snippet as layoutlm is now in transformers
from transformers import LayoutLMTokenizer, LayoutLMForSequenceClassification import torch
tokenizer = LayoutLMTokenizer.from_pretrained('microsoft/layoutlm-base-uncased') model = LayoutLMForSequenceClassification.from_pretrained('microsoft/layoutlm-base-uncased')
words = ["Hello", "world"] normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]
token_boxes = [] for word, box in zip(words, normalized_word_boxes): word_tokens = tokenizer.tokenize(word) token_boxes.extend([box] * len(word_tokens))
token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]
encoding = tokenizer(' '.join(words), return_tensors="pt") input_ids = encoding["input_ids"] attention_mask = encoding["attention_mask"] token_type_ids = encoding["token_type_ids"] bbox = torch.tensor([token_boxes]) sequence_label = torch.tensor([1])
outputs = model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=sequence_label)
loss = outputs.loss logits = outputs.logits
refer the doc https://huggingface.co/transformers/model_doc/layoutlm.html
Model: LayoutLM
I fine-tuned the model with my own dataset which includes several categories of forms. For creating hOCR XML's I used Tesseract.
First test: 3 categories and for each category 100 examples => 0.933..% accuracy Training set: 80% / Testing set: 15% / Validation set: 5%
Second test: 3 categories and for each category 200 examples => 0.7..% accuracy Training set: 80% / Testing set: 15% / Validation set: 5%
Third test: 3 categories (same categories as first test) and for each category 3000 examples => 0.33..% accuracy The model didn't improve during training, each epoch had the same accuracy. Training set: 80% / Testing set: 15% / Validation set: 5%