ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.09k stars 1.02k forks source link

Problem with OCR on a old scan file #699

Closed vistalba closed 3 years ago

vistalba commented 3 years ago

Describe the bug A older document is scanned as PDF. OCRmyPDF doen't find any text on it. Tried already some other psm without luck. On other PDF files OCR is working nearly perfect. May the problem is that this document uses a very old font which isn't recognized by OCR.

To Reproduce I use synOCR on my Synology NAS with following settings:

used image (created):     jbarlow83/ocrmypdf:latest (2020-12-22T09:46:25)
used ocr-parameter:       -srd -l deu+eng --output-type pdfa --image-dpi 300 --oversample 300 -v1

Log output:

              ➜ OCRmyPDF-LOG:
                    DEBUG ocrmypdf - ocrmypdf 11.4.0.post7+g4b8ccbe8.d20201222
                    DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs']
                    DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = List of available languages (7):
                  chi_sim
                  deu
                  eng
                  fra
                  osd
                  por
                  spa

                    DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
                    DEBUG ocrmypdf.subprocess - Found tesseract 4.1.1
                    DEBUG ocrmypdf.subprocess - Running: ['tesseract', '-l', 'eng+deu', '--print-parameters', 'pdf']
                    DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
                    DEBUG ocrmypdf.subprocess - Found gs 9.50
                    DEBUG ocrmypdf.helpers - pikepdf mmap disabled
                     INFO ocrmypdf._validation - reading file from standard input
                  WARNING ocrmypdf._pipeline - Argument --image-dpi is being ignored because the input file is a PDF, not an image.
                    DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.u6pklt9y/stdin, /tmp/ocrmypdf.io.u6pklt9y/origin.pdf)
                    DEBUG ocrmypdf.helpers - pikepdf mmap disabled
                    DEBUG ocrmypdf.builtin_plugins.tesseract_ocr - Using Tesseract OpenMP thread limit 3
                    DEBUG ocrmypdf.helpers - pikepdf mmap disabled
                    DEBUG ocrmypdf.subprocess -    1  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r400.000000x400.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.u6pklt9y/origin.pdf']
                    DEBUG ocrmypdf._exec.ghostscript -    1  Rotating output by 0
                    DEBUG ocrmypdf.subprocess -    1  Running: ['tesseract', '-l', 'osd', '--psm', '0', '/tmp/ocrmypdf.io.u6pklt9y/000001_rasterize_preview.jpg', 'stdout']
                     INFO ocrmypdf._exec.tesseract -    1  [tesseract] Too few characters. Skipping this page
                     INFO ocrmypdf._exec.tesseract -    1  [tesseract] Too few characters. Skipping this page
                    ERROR ocrmypdf._exec.tesseract -    1  [tesseract] Error during processing.
                     INFO ocrmypdf._pipeline -    1  page is facing ⇧, confidence 0.00 - no change
                    DEBUG ocrmypdf._pipeline -    1  Rasterize with png16m, rotation 0
                    DEBUG ocrmypdf.subprocess -    1  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r400.000000x400.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.u6pklt9y/origin.pdf']
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'iCCP' 41 2354
                    DEBUG PIL.PngImagePlugin -    1  iCCP profile name b'default_rgb.icc'
                    DEBUG PIL.PngImagePlugin -    1  Compression method 0
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 2407 9
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'tEXt' 2428 29
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 2469 8192
                    DEBUG ocrmypdf._exec.ghostscript -    1  Rotating output by 0
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 41 9
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 62 8192
                    DEBUG ocrmypdf._pipeline -    1  resolution (400, 400)
                    DEBUG ocrmypdf._pipeline -    1  convert
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 41 9
                    DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 62 8192
                    DEBUG root -    1  imgformat = PNG
                    DEBUG root -    1  input dpi = 400 x 400
                    DEBUG root -    1  rotation = 0°
                    DEBUG root -    1  input colorspace = RGB
                    DEBUG root -    1  width x height = 3304px x 4676px
                    DEBUG root -    1  read_images() embeds a PNG
                    DEBUG ocrmypdf._pipeline -    1  convert done
                    DEBUG ocrmypdf.subprocess -    1  Running: ['tesseract', '-l', 'deu+eng', '-c', 'textonly_pdf=1', PosixPath('/tmp/ocrmypdf.io.u6pklt9y/000001_ocr.png'), '/tmp/ocrmypdf.io.u6pklt9y/000001_ocr_tess', 'pdf', 'txt']
                    DEBUG ocrmypdf._graft -    1  Emplacement update
                    DEBUG ocrmypdf._graft -    1  Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
                    DEBUG ocrmypdf._graft -    1  Grafting
                    DEBUG ocrmypdf._graft -    1  Page rotation: (content, auto) -> page = (0, 0) -> 0
                     INFO ocrmypdf._sync - Postprocessing...
                    DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.u6pklt9y/graft_layers.pdf, /tmp/ocrmypdf.io.u6pklt9y/fix_docinfo.pdf)
                    DEBUG ocrmypdf.subprocess - Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.u6pklt9y/fix_docinfo.pdf', '/tmp/ocrmypdf.io.u6pklt9y/pdfa.ps']
                    DEBUG ocrmypdf.subprocess.gs - GPL Ghostscript 9.50 (2019-10-15)
                    DEBUG ocrmypdf.subprocess.gs - Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
                    DEBUG ocrmypdf.subprocess.gs - This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
                    DEBUG ocrmypdf.subprocess.gs - see the file COPYING for details.
                    DEBUG ocrmypdf.subprocess.gs - Processing pages 1 through 1.
                    DEBUG ocrmypdf.subprocess.gs - Page 1
                    DEBUG ocrmypdf.subprocess.gs - GPL Ghostscript 9.50: Setting Overprint Mode to 1
                    DEBUG ocrmypdf.subprocess.gs - not permitted in PDF/A-2, overprint mode not set
                    DEBUG ocrmypdf.subprocess.gs - 
                    DEBUG ocrmypdf._exec.ghostscript - Ghostscript had to remove PDF 'overprinting' from the input file to complete PDF/A conversion. 
                    DEBUG ocrmypdf.optimize - Treating 20 as an optimization candidate
                    DEBUG ocrmypdf.optimize - XrefExt(xref=20, ext='.png')
                    DEBUG ocrmypdf.optimize - Optimizable images: JPEGs: 0 PNGs: 1
                    DEBUG ocrmypdf.optimize - Treating 20 as an optimization candidate
                    DEBUG ocrmypdf.optimize - Optimizable images: JBIG2 groups: (0,)
                     INFO ocrmypdf.optimize - Optimize ratio: 1.00 savings: 0.0%
                    DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.u6pklt9y/optimize.opt.pdf, /tmp/ocrmypdf.io.u6pklt9y/optimize.pdf)
                    DEBUG ocrmypdf._pipeline - /tmp/ocrmypdf.io.u6pklt9y/optimize.pdf -> -
                     INFO ocrmypdf._sync - Output sent to stdout
              ← OCRmyPDF-LOG-END

Example file Encrypted and anonymized example file: https://1drv.ms/u/s!Aoevp124L-bsmm04SyCSFZI-NrSA?e=Ycaf5y

Expected behavior Printed text should be recognized by OCR. Handwritten text in table doesn't matter. Previously I used OmniPage ComDirect which has no problems to recognize this text. But I want to get rid of this windows tool.

System

jbarlow83 commented 3 years ago

This image is what Tesseract OCR sees before it attempts OCR. (Except for the small area hidden by the rectangle. There was text here that a human could read, but Tesseract could not read this.) Tesseract has a known issue with read dark text on bright backgrounds, among other issues. In short you bumped into https://github.com/tesseract-ocr/tesseract/issues/1990.

Screen Shot 2020-12-25 at 00 08 48

Use ocrmypdf --threshold to get an improved result which as far as I can tell, works correctly. Although for best results you should use all languages, not just deu+eng, and it looks like this file may use another language too, even if you don't care much about that language.

Perhaps I should make --threshold default behavior.

vistalba commented 3 years ago

Thanks for reply. Where is thie parameter —threshold documented? As I can‘t find it in the documentation and I do not know how to use/define it correctly.

If I understand you correct I sould always select all languages that could be in any of the input files not just the one I‘m interessted in?

jbarlow83 commented 3 years ago

--threshold has no arguments. It is documented in ocrmypdf --help although not in the general documentation.

On Fri., Dec. 25, 2020, 00:47 vistalba, notifications@github.com wrote:

Thanks for reply. Where is thie parameter —threshold documented? As I can‘t find it in the documentation and I do not know how to use/define it correctly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/699#issuecomment-751211438, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YM3HFUONIY6THWJ4UQTSWRGTNANCNFSM4VGZBKEQ .