huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.78k stars 26.46k forks source link

Add LayoutLMProcessor #27826

Open gau-nernst opened 10 months ago

gau-nernst commented 10 months ago

Feature request

Add processor for LayoutLM. I'm not sure why v2 and v3 have their respective processors, but the original v1 doesn't. It should be almost identical their v2 and v3 counterparts (apply tesseract OCR + call the tokenizer appropriately), without returning the resized image (pixel_values), since LayoutLMv1 is text-only.

This would also simplify document-question-answering pipeline, since right now the pipeline repeats the above logic for LayoutLM.

Motivation

Make LayoutLM feature-parity with its v2 and v3.

Your contribution

I can submit a PR to add LayoutLMProcessor. It should be almost identical to v2 and v3, so the task should be straight-forward.

Updating document-question-answering pipeline to use the new processor would be too complex since I'm not familiar with the codebase.

ArthurZucker commented 10 months ago

cc @amyeroberts and @NielsRogge if LayoutLM is just not as good we should use newest models

gau-nernst commented 10 months ago

There are several advantages in using LayoutLMv1:

ArthurZucker commented 10 months ago

Alright then! Feel free to open a PR if you have time

NielsRogge commented 8 months ago

Thanks @gau-nernst for opening this issue, indeed we only started defining processors for v2 and v3 but we could define one for v1 as well. Your PR already looks in a great state, let me know if you need any help.