ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

OCRmyPDF terminates with an error #979

Closed geimist closed 2 years ago

geimist commented 2 years ago

Describe the bug The input file is not processed. OCRmyPDF terminates with an error. "SubprocessOutputError: Ghostscript PDF/A rendering failed"

To Reproduce This error occurs when metadata is written in a PDF using PyPDF2. If for some reason the user wants OCRmyPDF to process the file again, this error occurs.

The error doesn't have to be OCRmyPDF, but I don't know how to avoid it.

Example file ocr_test_fehler.pdf

System

OCRmyPDF parameter:

-dcf -l deu --author John Doe

Errormessage:

  DEBUG ocrmypdf - ocrmypdf 13.4.7.post19+gd8753dc7.d20220613
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs']
  DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = List of available languages (7):
chi_sim
deu
eng
fra
osd
por
spa

  DEBUG ocrmypdf.subprocess - Running: ['unpaper', '--version']
  DEBUG ocrmypdf.subprocess - Found unpaper 6.1
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.subprocess - Found tesseract 4.1.1
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
  DEBUG ocrmypdf.subprocess - Found gs 9.55.0
   INFO ocrmypdf._validation - reading file from standard input
  DEBUG ocrmypdf.helpers - os.symlink(/tmp/ocrmypdf.io.9hl8_hur/stdin, /tmp/ocrmypdf.io.9hl8_hur/origin.pdf)
  DEBUG ocrmypdf.builtin_plugins.tesseract_ocr - Using Tesseract OpenMP thread limit 3
   INFO ocrmypdf._pipeline -    1  page already has text! - rasterizing text and running OCR anyway
  DEBUG ocrmypdf._pipeline -    1  Rasterize with png16m, rotation 0
  DEBUG ocrmypdf.subprocess -    1  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r400.000000x400.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.9hl8_hur/origin.pdf']
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'iCCP' 41 2354
  DEBUG PIL.PngImagePlugin -    1  iCCP profile name b'default_rgb.icc'
  DEBUG PIL.PngImagePlugin -    1  Compression method 0
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 2407 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'tEXt' 2428 31
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 2471 8192
  DEBUG ocrmypdf._exec.ghostscript -    1  Rotating output by 0
  DEBUG ocrmypdf.subprocess -    1  Running: ['tesseract', '-l', 'deu', '--psm', '2', '/tmp/ocrmypdf.io.9hl8_hur/000001_rasterize.png', 'stdout']
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'iCCP' 41 2350
  DEBUG PIL.PngImagePlugin -    1  iCCP profile name b'ICC Profile'
  DEBUG PIL.PngImagePlugin -    1  Compression method 0
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 2403 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 2424 65536
  DEBUG ocrmypdf.subprocess -    1  Running: ['tesseract', '-l', 'deu', '--psm', '2', '/tmp/ocrmypdf.io.9hl8_hur/000001_rasterize.png', 'stdout']
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'iCCP' 41 2350
  DEBUG PIL.PngImagePlugin -    1  iCCP profile name b'ICC Profile'
  DEBUG PIL.PngImagePlugin -    1  Compression method 0
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 2403 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 2424 65536
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'iCCP' 41 2350
  DEBUG PIL.PngImagePlugin -    1  iCCP profile name b'ICC Profile'
  DEBUG PIL.PngImagePlugin -    1  Compression method 0
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 2403 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 2424 65536
  DEBUG ocrmypdf.subprocess -    1  Running: ['unpaper', '-v', '--dpi', '400.0', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpr5xe_17c/input.pnm', '/tmp/tmpr5xe_17c/output.ppm']
  DEBUG ocrmypdf.subprocess.unpaper -    1  stdout/stderr = [ppm_pipe @ 0x563696cfe380] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x563696d04740] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x563696d04740] Encoder did not produce proper pts, making some up.
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

-------------------------------------------------------------------------------
Processing sheet #1: /tmp/tmpr5xe_17c/input.pnm -> /tmp/tmpr5xe_17c/output.ppm
input-file for sheet 1: /tmp/tmpr5xe_17c/input.pnm
output-file for sheet 1: /tmp/tmpr5xe_17c/output.ppm
sheet size: 3400x4400
...
noise-filter ... deleted 0 clusters.
blur-filter... deleted 0 pixels.
writing output.

  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 41 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 62 65536
  DEBUG ocrmypdf._pipeline -    1  resolution (399.9992, 399.9992)
  DEBUG ocrmypdf._pipeline -    1  convert
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -    1  STREAM b'iCCP' 41 2350
  DEBUG PIL.PngImagePlugin -    1  iCCP profile name b'ICC Profile'
  DEBUG PIL.PngImagePlugin -    1  Compression method 0
  DEBUG PIL.PngImagePlugin -    1  STREAM b'pHYs' 2403 9
  DEBUG PIL.PngImagePlugin -    1  STREAM b'IDAT' 2424 65536
  DEBUG img2pdf -    1  PIL format = PNG
  DEBUG img2pdf -    1  imgformat = PNG
  DEBUG img2pdf -    1  input dpi = 400 x 400
  DEBUG img2pdf -    1  rotation = 0°
  DEBUG img2pdf -    1  input colorspace = RGB
  DEBUG img2pdf -    1  width x height = 3400px x 4400px
  DEBUG img2pdf -    1  read_images() embeds a PNG
  DEBUG ocrmypdf._pipeline -    1  convert done
  DEBUG ocrmypdf.subprocess -    1  Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.9hl8_hur/000001_ocr.png', '/tmp/ocrmypdf.io.9hl8_hur/000001_ocr_tess', 'pdf', 'txt']
  DEBUG ocrmypdf._graft -    1  Emplacement update
  DEBUG ocrmypdf._graft -    1  Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
  DEBUG ocrmypdf._graft -    1  Grafting
  DEBUG ocrmypdf._graft -    1  Page rotation: (content, auto) -> page = (0, 0) -> 0
   INFO ocrmypdf._sync - Postprocessing...
  DEBUG ocrmypdf.subprocess - Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.9hl8_hur/fix_docinfo.pdf', '/tmp/ocrmypdf.io.9hl8_hur/pdfa.ps']
  DEBUG ocrmypdf.subprocess.gs - GPL Ghostscript 9.55.0 (2021-09-27)
  DEBUG ocrmypdf.subprocess.gs - Copyright (C) 2021 Artifex Software, Inc.  All rights reserved.
  DEBUG ocrmypdf.subprocess.gs - This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
  DEBUG ocrmypdf.subprocess.gs - see the file COPYING for details.
  DEBUG ocrmypdf.subprocess.gs - Error: /typecheck in --runpdf--
  DEBUG ocrmypdf.subprocess.gs - Operand stack:
  DEBUG ocrmypdf.subprocess.gs - --dict:6/6(L)--   ()   --nostringval--   --nostringval--
  DEBUG ocrmypdf.subprocess.gs - Execution stack:
  DEBUG ocrmypdf.subprocess.gs - %interp_exit   .runexec2   --nostringval--   runpdf   --nostringval--   2   %stopped_push   --nostringval--   runpdf   runpdf   false   1   %stopped_push   1990   1   3   %oparray_pop   1989   1   3   %oparray_pop   1977   1   3   %oparray_pop   1978   1   3   %oparray_pop   runpdf   runpdf
  DEBUG ocrmypdf.subprocess.gs - Dictionary stack:
  DEBUG ocrmypdf.subprocess.gs - --dict:772/1123(ro)(G)--   --dict:1/20(G)--   --dict:80/200(L)--   --dict:80/200(L)--   --dict:134/256(ro)(G)--   --dict:324/325(ro)(G)--   --dict:31/32(L)--
  DEBUG ocrmypdf.subprocess.gs - Current allocation mode is local
  DEBUG ocrmypdf.subprocess.gs - GPL Ghostscript 9.55.0: Unrecoverable error, exit code 1
  ERROR ocrmypdf._exec.ghostscript - GPL Ghostscript 9.55.0 (2021-09-27)
Copyright (C) 2021 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Error: /typecheck in --runpdf--
Operand stack:
   --dict:6/6(L)--   ()   --nostringval--   --nostringval--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   runpdf   --nostringval--   2   %stopped_push   --nostringval--   runpdf   runpdf   false   1   %stopped_push   1990   1   3   %oparray_pop   1989   1   3   %oparray_pop   1977   1   3   %oparray_pop   1978   1   3   %oparray_pop   runpdf   runpdf
Dictionary stack:
   --dict:772/1123(ro)(G)--   --dict:1/20(G)--   --dict:80/200(L)--   --dict:80/200(L)--   --dict:134/256(ro)(G)--   --dict:324/325(ro)(G)--   --dict:31/32(L)--
Current allocation mode is local
GPL Ghostscript 9.55.0: Unrecoverable error, exit code 1

  ERROR ocrmypdf._sync - ExitCodeException
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_exec/ghostscript.py", line 251, in generate_pdfa
    p = run_polling_stderr(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/subprocess/__init__.py", line 105, in run_polling_stderr
    raise CalledProcessError(proc.returncode, args, output=None, stderr=stderr)
subprocess.CalledProcessError: Command '['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.9hl8_hur/fix_docinfo.pdf', '/tmp/ocrmypdf.io.9hl8_hur/pdfa.ps']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_sync.py", line 393, in run_pipeline
    optimize_messages = exec_concurrent(context, executor)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_sync.py", line 309, in exec_concurrent
    pdf, messages = post_process(pdf, context, executor)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_sync.py", line 239, in post_process
    pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_pipeline.py", line 736, in convert_to_pdfa
    context.plugin_manager.hook.generate_pdfa(
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 60, in _multicall
    return outcome.get_result()
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_result.py", line 60, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 39, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 93, in generate_pdfa
    ghostscript.generate_pdfa(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_exec/ghostscript.py", line 265, in generate_pdfa
    raise SubprocessOutputError('Ghostscript PDF/A rendering failed') from e
ocrmypdf.exceptions.SubprocessOutputError: Ghostscript PDF/A rendering failed
jbarlow83 commented 2 years ago

You can use Ghostscript to fix this particular error that PyPDF2 created:

gs -q -sDEVICE=pdfwrite -o issue979_gs.pdf issue979.pdf
ocrmypdf -dcf  issue979_gs.pdf _.pdf

At this point I won't take any further action because there's an easy workaround. I did not investigate the cause, but poppler pdfinfo complains of:

Syntax Error: Can't get Fields array<0a>

In short this PDF is not well-formed and Ghostscript/OCRmyPDF are "within their rights" to reject it as noncompliant.

If this type of error starts popping up more commonly I'll investigate further and decide if it needs to be reported to either PyPDF2 or Ghostscript, and if OCRmyPDF should implement a shim to detect and resolve the issue in advance.

geimist commented 2 years ago

The current PyPDF2 version 2.3.1 works fine. Thank you for your support.