microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.12k stars 2.55k forks source link

LayoutLM document classification #310

Open huschi opened 3 years ago

huschi commented 3 years ago

Model: LayoutLM

I fine-tuned the model with my own dataset which includes several categories of forms. For creating hOCR XML's I used Tesseract.

First test: 3 categories and for each category 100 examples => 0.933..% accuracy Training set: 80% / Testing set: 15% / Validation set: 5%

Second test: 3 categories and for each category 200 examples => 0.7..% accuracy Training set: 80% / Testing set: 15% / Validation set: 5%

Third test: 3 categories (same categories as first test) and for each category 3000 examples => 0.33..% accuracy The model didn't improve during training, each epoch had the same accuracy. Training set: 80% / Testing set: 15% / Validation set: 5%

python run_classification.py --data_dir data_folder \ --model_type layoutlm \ --model_name_or_path model_folder \ --output_dir output \ --do_lower_case \ --max_seq_length 512 \ --do_train \ --do_eval \ --num_train_epochs 40.0 \ --logging_steps 5000 \ --save_steps 5000 \ --per_gpu_train_batch_size 2 \ --per_gpu_eval_batch_size 2 \ --evaluate_during_training \ --fp16 \ --overwrite_output_dir

knitemblazor commented 3 years ago

have you considered the 512 token limit while working on the classification task?

m-ali-awan commented 3 years ago

Hi, I also want to use LayoutLM, for custom document classification in 20 categories. So kindly guide me the right approach, or any helping notebook to start with.

Thanks

knitemblazor commented 3 years ago

kindly go through this snippet as layoutlm is now in transformers

from transformers import LayoutLMTokenizer, LayoutLMForSequenceClassification import torch

tokenizer = LayoutLMTokenizer.from_pretrained('microsoft/layoutlm-base-uncased') model = LayoutLMForSequenceClassification.from_pretrained('microsoft/layoutlm-base-uncased')

words = ["Hello", "world"] normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]

token_boxes = [] for word, box in zip(words, normalized_word_boxes): word_tokens = tokenizer.tokenize(word) token_boxes.extend([box] * len(word_tokens))

add bounding boxes of cls + sep tokens

token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]

encoding = tokenizer(' '.join(words), return_tensors="pt") input_ids = encoding["input_ids"] attention_mask = encoding["attention_mask"] token_type_ids = encoding["token_type_ids"] bbox = torch.tensor([token_boxes]) sequence_label = torch.tensor([1])

outputs = model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=sequence_label)

loss = outputs.loss logits = outputs.logits

refer the doc https://huggingface.co/transformers/model_doc/layoutlm.html