philschmid / document-ai-transformers

MIT License
329 stars 50 forks source link

Which is the correct bbox ocr level for LiLT? block level or word level? #9

Open aimlnerd opened 1 year ago

aimlnerd commented 1 year ago

I am using your nice tutorial for applying lilt on excel file converted to images and the text is in dutch.

https://github.com/jpWang/LiLT/issues/28 In the above link, author of LILT has mentioned that the model is pretrained on "segment-level box".

During inference in your code ocr is applied

https://github.com/philschmid/document-ai-transformers/blob/main/training/lilt_funsd.ipynb

# change apply_ocr to True to use the ocr text for inference
processor.feature_extractor.apply_ocr = True

Question

  1. which kind of ocr is applied in processor.feature_extractor.apply_ocr = True ? word token level or "segment-level box"?
  2. How to ensure the same "segment-level box" ocr is applied for finetuning and inference?
  3. Any pointers on implement the correct ocr level using pytesseract?