words missing - Githubissues

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

Apache License 2.0

62.23k stars 9.51k forks source link

words missing #2730

Open behel33 opened 5 years ago

behel33 commented 5 years ago

In this file tesseract (v4 to v5 beta/master) don't detect/see word, What can I do to avoid this ?

The command is :

>tesseract "MMS-Bureau valey-3.png" out -l fra hocr pdf txt
Tesseract Open Source OCR Engine v5.0.0-alpha-465-g445d with Leptonica
Detected 7 diacritics

See in red, words/lines missing : MMS-Bureau valey-3-missing

Original file :
MMS-Bureau valey-3

pdf file width ocr : out.pdf

veonua commented 4 years ago

having the same issue with the text in tables, or having black boxes around

@amitdo any ideas how do we debug this layout issue?

veonua commented 4 years ago

@behel33 try to run with --psm 11 or 12, it yields more words.

But it looks like Tesseract Page Segmentation Mechanism goes crazy when there is a mix of cells with dark\white backgrounds in the same table.

behel33 commented 4 years ago

Tesseract Page segmention had probleme with text too near of border and dark\white backgrounds The tesseract page segmentation use by tesseract is too bad and old. I see demo of ocropus segmentation that are best.