microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.08k stars 2.55k forks source link

Layoutlm not classifying bottom half of documents #935

Open DoubtfulCoder opened 1 year ago

DoubtfulCoder commented 1 year ago

Describe Model I am using (UniLM, MiniLM, LayoutLM ...): Layoutlm

I am trying to use Layoutlm for resume parsing. I've labeled and trained on over 100 resumes and am currently reaching an F1 score of around 0.55 and accuracy around 85%. However, when I run inference, many of the documents have large portions (clustered at the bottom) of the text left unclassified. The resumes that I've run inference on are in a similar format to those trained on and should have similar locations of bounding boxes. Why is layoutlm not classifying them? If it's overfitting, what can I do about it?

Example (blurred for personal info): image

wolfshow commented 1 year ago

@DoubtfulCoder, That's not overfitting. The reason is LayoutLM processes the document in a windows size of 512 tokens. If your document is longer than 512 tokens, you need to split the page into multiple samples for the model to process, both for training and testing.

DoubtfulCoder commented 1 year ago

Thanks for your help @wolfshow . By tokens, do you mean just words or do they refer to something else as well?

How can I handle this length limit in training and in inference? I see some things about sliding window approach by moving 128 characters at a time. Can you provide some example code?

davelza95 commented 1 year ago

I have a similar issue, I have tried to change the seq_max_length, but I got a Cuda error in training.

I have tried changing the max_position_embeddings = 1024, and the bboxes size (1024+196+1, 4) after I tokenized them, but this hasn't worked.

Note: Why 196 + 1

Can someone help me, please.

davelza95 commented 1 year ago

Thanks for your help @wolfshow . By tokens, do you mean just words or do they refer to something else as well?

How can I handle this length limit in training and in inference? I see some things about sliding window approach by moving 128 characters at a time. Can you provide some example code?

Hi! Did you fix it ?

DoubtfulCoder commented 1 year ago

Thanks for your help @wolfshow . By tokens, do you mean just words or do they refer to something else as well? How can I handle this length limit in training and in inference? I see some things about sliding window approach by moving 128 characters at a time. Can you provide some example code?

Hi! Did you fix it ?

Hi, I did not try increasing max_position_embeddings but just used a sliding window approach. Basically, if the number of words is greater than 315, just slides windows of 100 characters (0-300, 100-400, etc.) and then aggregate the predictions.