Open behel33 opened 5 years ago
having the same issue with the text in tables, or having black boxes around
@amitdo any ideas how do we debug this layout issue?
@behel33 try to run with --psm 11 or 12, it yields more words.
But it looks like Tesseract Page Segmentation Mechanism goes crazy when there is a mix of cells with dark\white backgrounds in the same table.
Tesseract Page segmention had probleme with text too near of border and dark\white backgrounds The tesseract page segmentation use by tesseract is too bad and old. I see demo of ocropus segmentation that are best.
In this file tesseract (v4 to v5 beta/master) don't detect/see word, What can I do to avoid this ?
The command is :
See in red, words/lines missing :
Original file :
pdf file width ocr : out.pdf