Closed ThomasDelteil closed 2 years ago
@ThomasDelteil Thanks for the question! LayoutLMv3 still considers the FUNSD problem as a token-level labeling task. But it takes advantage of the textlines/segments output from the OCR engine, where 2D positions of the whole segments are used in the pre-training and fine-tuning tasks. We use segment 2D position embeddings for all downstream tasks.
Thanks @wolfshow for your answer. Can you point me in the code where the word and segment information is taken from the OCR engine? From looking at this file: https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py it seems that the word and segment information are loaded from the FUNSD dataset.
Unfortunately the FUNSD dataset format doesn't give you OCR-engine-like inputs. For example, you can have multiple entities on the same line or even have entities that are multi-line. If you wanted to use the same style as during the pre-training, you could run an OCR engine in parallel to loading the annotations, and then post-process the OCR engine segment outputs to match them to the words from the annotations and use theses segments for your labeling task. As it stands, from what I can read from the code, it is not a real token-level labeling task as the entity groupings are leaking through the segment embeddings. If this code is what was used to get the results from the paper, the comparison Table 1
might not really be a fair one.
See this picture of the data to understand what I mean. You can see that the rectangles representing the "lines" or "segments" used from the inputs are not OCR engine lines but the actual entities.
In the pre-training, we get these segments from the OCR engines to get the 2D position embeddings. For downstream tasks, we just used the segments for whatever was given from the dataset. We do this to avoid the messy reading order problem during pre-training. Meanwhile, there has been some previous work in Table 1 that has already used the segment 2D embeddings, which we have mentioned in Section 2.1.
Thanks @ThomasDelteil for the question and @wolfshow for the clarification!
TL;DR: LayoutLMv3 uses segment-level layout positions and conducts semantic entity labeling on the FUNSD. We have pointed out in our paper that the comparison between segment-level and word-level works is not directly comparable.
Detailed explanation: The FUNSD paper gives a definition of three tasks on FUNSD:
Form understanding: We decompose the FoUn challenge into three tasks, namely word grouping, semantic-entity labeling, and entity linking. • Word grouping is the task of aggregating words that belong to the same semantic entity. • Semantic entity labeling is the task of assigning to each semantic entity a label from a set of four predefined categories: question, answer, header or other. • Entity linking is the task of predicting the relations between semantic entities.
LayoutLMv3 follows some previous works (e.g., LayoutLMv2, StructuralLM) to conduct the semantic entity labeling task, which is about "assigning to each semantic entity a label" (labeling of already grouped entities).
The FUNSD dataset is suitable for a variety of tasks, where we focus on semantic entity labeling in this paper. (LayoutLMv2) The FUNSD dataset is suitable for a variety of tasks, where we just fine-tuning StructuralLM on semantic entity labeling. (StructuralLM) We focus on semantic entity labeling task on the FUNSD dataset to assign each semantic entity a label among “question”, “answer”, “header” or “other”. (LayoutLMv3)
In LayoutLMv3 paper Section 3.3, we point out the potential performance gap between segment-level and word-level layout positions:
Note that LayoutLMv3 and StructuralLM use segment-level layout positions, while the other works use word-level layout positions. The use of segment-level positions may benefit the semantic entity labeling task on FUNSD [25], so the two types of work are not directly comparable.
Thanks @wolfshow and @HYPJUDY for the detailed answers. I agree that your approach follows the original FUNSD dataset paper entity labeling task definition. It would have been indeed easier if all prior work using FUNSD would have stuck to that original definition as well to make comparisons easier. Quick question, in LayoutLMv2, I don't think the segment-level / grouped entity information is used other than for the 1D ordering of the words, making the semantic labeling task actually a 1D grouping task + labeling. Is my understanding correct? Or do you know if the entity group information used in some post-processing to average labels across entities?
Your understanding is correct. To my knowledge, only LayoutLMv3 and StructuralLM use segment-level layout positions that take advantage of grouping. I came to this conclusion by reading their papers, so it may not be entirely correct either.
But it takes advantage of the textlines/segments output from the OCR engine, where 2D positions of the whole segments are used in the pre-training and fine-tuning tasks
@wolfshow Thank you for this discussion. It's been very helpful. I am specifically wondering about this info I have quoted. Is there any discussion around the OCR engine that was used for pretraining to utilize segment positions instead of word-level positions? Is this a special OCR engine or a model trained for segment extraction? It's a little confusing how much is being hand-waved here on this piece or if I'm just missing something. Thanks!
Hi @wandering-walrus, please see https://github.com/microsoft/unilm/issues/838.
Thanks @wolfshow for your answer. Can you point me in the code where the word and segment information is taken from the OCR engine? From looking at this file: https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py it seems that the word and segment information are loaded from the FUNSD dataset.
- loading the data from FUNSD dataset https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py#L75
- reading each entity iteratively https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py#L110
- Creating the token-level labels indeed https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py#L123
- Creating the segment information from all the words of that entity (that's where the segment label is leaked) https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py#L131
Unfortunately the FUNSD dataset format doesn't give you OCR-engine-like inputs. For example, you can have multiple entities on the same line or even have entities that are multi-line. If you wanted to use the same style as during the pre-training, you could run an OCR engine in parallel to loading the annotations, and then post-process the OCR engine segment outputs to match them to the words from the annotations and use theses segments for your labeling task. As it stands, from what I can read from the code, it is not a real token-level labeling task as the entity groupings are leaking through the segment embeddings. If this code is what was used to get the results from the paper, the comparison
Table 1
might not really be a fair one.See this picture of the data to understand what I mean. You can see that the rectangles representing the "lines" or "segments" used from the inputs are not OCR engine lines but the actual entities.
@ThomasDelteil @HYPJUDY @wolfshow
Quick question for clarification which might be helpful to others as well. In this case FUNSD is labelled per segment. But what if another dataset is word-labelled. That is, a detected segment/line has words of which some are label A
and some are label B
. Would I then use the original segment position or would I need to split them apart?
I was curious about LayoutLMv3 on FUNSD results which are much higher than previous SOTA. At this line https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/data/funsd.py#L131 is LayoutLMv3 trained/evaluated on FUNSD using the ground truth provided entity grouping for its line segment embeddings as it is iterating through the GT items?
Would that make the task an entity labeling task (labeling of already grouped entities) rather than an (arguably harder) entity extraction task (grouping of words + labeling)? In the paper it does seem to say that it only does semantic labeling of entities:
but then in the table it compares LayoutLMv3 entity labeling results to entity extraction results from prior work so I'm a bit confused.
Is it the same for CORD? Are the word groupings coming from the OCR engine or from the ground-truth annotations?
Thanks!