Word or segment position embeddings?

jpWang / LiLT

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

MIT License

335 stars 40 forks source link

Word or segment position embeddings? #28

Closed NielsRogge closed 1 year ago

NielsRogge commented 1 year ago

Hi @jpWang,

I had a question related to LiLT; namely whether or not you're leveraging bounding boxes per word or per segment when fine-tuning on FUNSD. The LayoutLMv3 authors saw a great boost in performance when employing the same bounding box coordinates for a set of words that make up a "segment", like an address on an invoice. They use the OCR engine to identify segments in a document, and then give the same bounding box coordinates to all the words that make up that segment (an idea which was introduced in StructuralLM).

LayoutLMv1 and v2 both use "word position embeddings," which means that each individual word has its own bounding box coordinates.

Does LiLT achieve 88% F1 on FUNSD with word position embeddings? Looking at this file, it seems word position embeddings are used.

jpWang commented 1 year ago

Hi, LiLT uses segment position embeddings. item["box"] at https://github.com/jpWang/LiLT/blob/main/LiLTfinetune/data/datasets/funsd.py#L106 means segment-level box.

NielsRogge commented 1 year ago

Thanks a lot!

Ok that's actually an important point, because for instance LiLT-roberta-en-base obtains 88% F1 score on FUNSD with segment position embeddings, but only 80% with word position embeddings.

The metrics reported in the paper for LayoutLM and LayoutLMv2 are also with word position embeddings, so it's actually not a fair comparison (as those models also perform equally well with segment position embeddings).

Would it be possible to mention this in the paper?

jpWang commented 1 year ago

In the paper we have mentioned that LiLT uses segment position embeddings in section 2.1.2 Layout Embedding.

And LiLT pretrained with word position embeddings will also perform better than LiLT pretrained with segment position embeddings when finetuning on data with word position embeddings.

I have also noticed that StructuralLM/LayoutLMv3 also doesn't compare the result of LayoutLMv1(v2) with segment position embeddings.

sumanth9977 commented 8 months ago

Can we use model for commercial use for free or at cost?

felixvor commented 5 months ago

We are using another commercial OCR engine and have different levels of bounding box information. We are curious to which level is the most similar to the pre-training data for LILT. The paper does not go into much detail about the size of the bounding boxes and their corresponding text-strings. What exactly does "segment" box mean? Is it a line? Is it a paragraph? Is it an even larger text-box?

@NielsRogge You said that using segment boxes yield higher performance scores, do you have these experiments documented somewhere so we can compare your boxes with our data preparation? As far as I see you LILT Example Notebooks only show the experiements for word level bounding boxes, is that right? It would be nice to verify, that the boxes we use for training match the recommended shape from LILT Pre-Training with your 88% F1 score experiment.

NielsRogge commented 5 months ago

Yes I compared using the same notebook on https://huggingface.co/datasets/nielsr/funsd vs. https://huggingface.co/datasets/nielsr/funsd-layoutlmv3 and the result in F1 score was 82 vs. 88%. The latter uses segment positions.