Inference results of fine-tuned LayoutLM model differ depending on word/input id and box order

Hi, I have fine-tuned LayoutLM (v1) on my own invoice data. The model, after 4 epochs, reaches a pretty good performance.

When using it for inference, though, I get different outputs depending on the order of input_ids and bbox tensors in the encoding. The difference I observe is mostly based on the order of other words versus words with any label except other. There is three different orderings I have tested:

first all words/boxes semantic interesting boxes (i.e. boxes are expected to be classified with any label except other), then all other boxes
random order of labeled other versus non-other words/boxes
words/boxes ordered by the box position, top left to bottom right

When I run the inference, the model yields the following predictions (I did only visualize the boxes that have non-other labels):

First non-other boxes/words, then other boxes/words
Random order
Top left to bottom right order

Case 1 matches the ground truth the most... However, the difference in results between the cases is not what I expected... I expect the same results for all these cases, i.e. that the result is independent of how words/boxes in the encoding for inference are ordered.

If the word/box order is relevant, what is the correct order for training and for inference?

Do you think it is beneficial for getting order-independent inference results to shuffle the word order for each training sample?

If useful, I can provide the encoding and fine-tuned model.

microsoft / unilm

Inference results of fine-tuned LayoutLM model differ depending on word/input id and box order #1098