microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.58k stars 2.5k forks source link

LayoutLMv3 on FUNSD using ground-truth entity groupings #793

Closed ThomasDelteil closed 2 years ago

ThomasDelteil commented 2 years ago

I was curious about LayoutLMv3 on FUNSD results which are much higher than previous SOTA. At this line https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py#L131 is LayoutLMv3 trained/evaluated on FUNSD using the ground truth provided entity grouping for its line segment embeddings as it is iterating through the GT items?

Would that make the task an entity labeling task (labeling of already grouped entities) rather than an (arguably harder) entity extraction task (grouping of words + labeling)? In the paper it does seem to say that it only does semantic labeling of entities:

We focus on semantic entity labeling task on the FUNSD dataset to assign each semantic entity a label among “question”, “answer”, “header” or “other”.

but then in the table it compares LayoutLMv3 entity labeling results to entity extraction results from prior work so I'm a bit confused.

Is it the same for CORD? Are the word groupings coming from the OCR engine or from the ground-truth annotations?

Thanks!

wolfshow commented 2 years ago

@ThomasDelteil Thanks for the question! LayoutLMv3 still considers the FUNSD problem as a token-level labeling task. But it takes advantage of the textlines/segments output from the OCR engine, where 2D positions of the whole segments are used in the pre-training and fine-tuning tasks. We use segment 2D position embeddings for all downstream tasks.

ThomasDelteil commented 2 years ago

Thanks @wolfshow for your answer. Can you point me in the code where the word and segment information is taken from the OCR engine? From looking at this file: https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py it seems that the word and segment information are loaded from the FUNSD dataset.

Unfortunately the FUNSD dataset format doesn't give you OCR-engine-like inputs. For example, you can have multiple entities on the same line or even have entities that are multi-line. If you wanted to use the same style as during the pre-training, you could run an OCR engine in parallel to loading the annotations, and then post-process the OCR engine segment outputs to match them to the words from the annotations and use theses segments for your labeling task. As it stands, from what I can read from the code, it is not a real token-level labeling task as the entity groupings are leaking through the segment embeddings. If this code is what was used to get the results from the paper, the comparison Table 1 might not really be a fair one.

See this picture of the data to understand what I mean. You can see that the rectangles representing the "lines" or "segments" used from the inputs are not OCR engine lines but the actual entities.

wolfshow commented 2 years ago

In the pre-training, we get these segments from the OCR engines to get the 2D position embeddings. For downstream tasks, we just used the segments for whatever was given from the dataset. We do this to avoid the messy reading order problem during pre-training. Meanwhile, there has been some previous work in Table 1 that has already used the segment 2D embeddings, which we have mentioned in Section 2.1.

HYPJUDY commented 2 years ago

Thanks @ThomasDelteil for the question and @wolfshow for the clarification!

TL;DR: LayoutLMv3 uses segment-level layout positions and conducts semantic entity labeling on the FUNSD. We have pointed out in our paper that the comparison between segment-level and word-level works is not directly comparable.

Detailed explanation: The FUNSD paper gives a definition of three tasks on FUNSD:

Form understanding: We decompose the FoUn challenge into three tasks, namely word grouping, semantic-entity labeling, and entity linking. • Word grouping is the task of aggregating words that belong to the same semantic entity. • Semantic entity labeling is the task of assigning to each semantic entity a label from a set of four predefined categories: question, answer, header or other. • Entity linking is the task of predicting the relations between semantic entities.

LayoutLMv3 follows some previous works (e.g., LayoutLMv2, StructuralLM) to conduct the semantic entity labeling task, which is about "assigning to each semantic entity a label" (labeling of already grouped entities).

The FUNSD dataset is suitable for a variety of tasks, where we focus on semantic entity labeling in this paper. (LayoutLMv2) The FUNSD dataset is suitable for a variety of tasks, where we just fine-tuning StructuralLM on semantic entity labeling. (StructuralLM) We focus on semantic entity labeling task on the FUNSD dataset to assign each semantic entity a label among “question”, “answer”, “header” or “other”. (LayoutLMv3)

In LayoutLMv3 paper Section 3.3, we point out the potential performance gap between segment-level and word-level layout positions:

Note that LayoutLMv3 and StructuralLM use segment-level layout positions, while the other works use word-level layout positions. The use of segment-level positions may benefit the semantic entity labeling task on FUNSD [25], so the two types of work are not directly comparable.

ThomasDelteil commented 2 years ago

Thanks @wolfshow and @HYPJUDY for the detailed answers. I agree that your approach follows the original FUNSD dataset paper entity labeling task definition. It would have been indeed easier if all prior work using FUNSD would have stuck to that original definition as well to make comparisons easier. Quick question, in LayoutLMv2, I don't think the segment-level / grouped entity information is used other than for the 1D ordering of the words, making the semantic labeling task actually a 1D grouping task + labeling. Is my understanding correct? Or do you know if the entity group information used in some post-processing to average labels across entities?

HYPJUDY commented 2 years ago

Your understanding is correct. To my knowledge, only LayoutLMv3 and StructuralLM use segment-level layout positions that take advantage of grouping. I came to this conclusion by reading their papers, so it may not be entirely correct either.

wandering-walrus commented 2 years ago

But it takes advantage of the textlines/segments output from the OCR engine, where 2D positions of the whole segments are used in the pre-training and fine-tuning tasks

@wolfshow Thank you for this discussion. It's been very helpful. I am specifically wondering about this info I have quoted. Is there any discussion around the OCR engine that was used for pretraining to utilize segment positions instead of word-level positions? Is this a special OCR engine or a model trained for segment extraction? It's a little confusing how much is being hand-waved here on this piece or if I'm just missing something. Thanks!

HYPJUDY commented 2 years ago

Hi @wandering-walrus, please see https://github.com/microsoft/unilm/issues/838.

philmas commented 6 months ago

Thanks @wolfshow for your answer. Can you point me in the code where the word and segment information is taken from the OCR engine? From looking at this file: https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py it seems that the word and segment information are loaded from the FUNSD dataset.

Unfortunately the FUNSD dataset format doesn't give you OCR-engine-like inputs. For example, you can have multiple entities on the same line or even have entities that are multi-line. If you wanted to use the same style as during the pre-training, you could run an OCR engine in parallel to loading the annotations, and then post-process the OCR engine segment outputs to match them to the words from the annotations and use theses segments for your labeling task. As it stands, from what I can read from the code, it is not a real token-level labeling task as the entity groupings are leaking through the segment embeddings. If this code is what was used to get the results from the paper, the comparison Table 1 might not really be a fair one.

See this picture of the data to understand what I mean. You can see that the rectangles representing the "lines" or "segments" used from the inputs are not OCR engine lines but the actual entities.

@ThomasDelteil @HYPJUDY @wolfshow Quick question for clarification which might be helpful to others as well. In this case FUNSD is labelled per segment. But what if another dataset is word-labelled. That is, a detected segment/line has words of which some are label A and some are label B. Would I then use the original segment position or would I need to split them apart?