microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.16k stars 2.55k forks source link

Inference results of fine-tuned LayoutLM model differ depending on word/input id and box order #1098

Open rahelbeloch opened 1 year ago

rahelbeloch commented 1 year ago

Hi, I have fine-tuned LayoutLM (v1) on my own invoice data. The model, after 4 epochs, reaches a pretty good performance.

Screenshot 2023-05-23 at 15 14 57

When using it for inference, though, I get different outputs depending on the order of input_ids and bbox tensors in the encoding. The difference I observe is mostly based on the order of other words versus words with any label except other. There is three different orderings I have tested:

  1. first all words/boxes semantic interesting boxes (i.e. boxes are expected to be classified with any label except other), then all other boxes
  2. random order of labeled other versus non-other words/boxes
  3. words/boxes ordered by the box position, top left to bottom right

When I run the inference, the model yields the following predictions (I did only visualize the boxes that have non-other labels):

  1. First non-other boxes/words, then other boxes/words

    ner_output_1
  2. Random order

    ner_output_2
  3. Top left to bottom right order

    ner_output_3

Case 1 matches the ground truth the most... However, the difference in results between the cases is not what I expected... I expect the same results for all these cases, i.e. that the result is independent of how words/boxes in the encoding for inference are ordered.

If the word/box order is relevant, what is the correct order for training and for inference?

Do you think it is beneficial for getting order-independent inference results to shuffle the word order for each training sample?

If useful, I can provide the encoding and fine-tuned model.

suresh1505 commented 1 year ago

Have you worked on object detection(LayoutLMv3) ? I am not able to get the input_ids ,bbox and attention_masks tensors in the encoding.