Closed vistalba closed 3 years ago
This image is what Tesseract OCR sees before it attempts OCR. (Except for the small area hidden by the rectangle. There was text here that a human could read, but Tesseract could not read this.) Tesseract has a known issue with read dark text on bright backgrounds, among other issues. In short you bumped into https://github.com/tesseract-ocr/tesseract/issues/1990.
Use ocrmypdf --threshold
to get an improved result which as far as I can tell, works correctly. Although for best results you should use all languages, not just deu+eng, and it looks like this file may use another language too, even if you don't care much about that language.
Perhaps I should make --threshold
default behavior.
Thanks for reply. Where is thie parameter —threshold documented? As I can‘t find it in the documentation and I do not know how to use/define it correctly.
If I understand you correct I sould always select all languages that could be in any of the input files not just the one I‘m interessted in?
--threshold has no arguments. It is documented in ocrmypdf --help although not in the general documentation.
On Fri., Dec. 25, 2020, 00:47 vistalba, notifications@github.com wrote:
Thanks for reply. Where is thie parameter —threshold documented? As I can‘t find it in the documentation and I do not know how to use/define it correctly.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/699#issuecomment-751211438, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YM3HFUONIY6THWJ4UQTSWRGTNANCNFSM4VGZBKEQ .
Describe the bug A older document is scanned as PDF. OCRmyPDF doen't find any text on it. Tried already some other psm without luck. On other PDF files OCR is working nearly perfect. May the problem is that this document uses a very old font which isn't recognized by OCR.
To Reproduce I use synOCR on my Synology NAS with following settings:
Log output:
Example file Encrypted and anonymized example file: https://1drv.ms/u/s!Aoevp124L-bsmm04SyCSFZI-NrSA?e=Ycaf5y
Expected behavior Printed text should be recognized by OCR. Handwritten text in table doesn't matter. Previously I used OmniPage ComDirect which has no problems to recognize this text. But I want to get rid of this windows tool.
System
ocrmypdf 11.4.0.post7+g4b8ccbe8.d20201222
pip
, or a Docker image? -> docker ocrmypdf:latest