ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.04k stars 1.01k forks source link

[Bug]: Ghostscript PDF/A rendering failed #1267

Closed davide125 closed 7 months ago

davide125 commented 8 months ago

Describe the bug

When ingesting https://cdn-data.motu.com/manuals/usb-c-audio/M_Series_User_Guide.pdf in paperless-ngx, ocrmypdf fails because gs failed with exit status 1.

Steps to reproduce

[2024-02-29 17:45:42,618] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-ngxq78cjss6/M_Series_User_Guide.pdf'), 'output_file': PosixPath('/tmp/paperless/paperless-f5nbuh0f/archive.pdf'), 'use_threads': True, 'jobs': 12, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'color_conversion_strategy': 'RGB', 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-f5nbuh0f/sidecar.txt')}

Files

https://cdn-data.motu.com/manuals/usb-c-audio/M_Series_User_Guide.pdf

How did you download and install the software?

Docker container

OCRmyPDF version

15.4

Relevant log output

[2024-02-29 17:45:42,432] [INFO] [paperless.consumer] Consuming M_Series_User_Guide.pdf

[2024-02-29 17:45:42,434] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2024-02-29 17:45:42,440] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2024-02-29 17:45:42,443] [DEBUG] [paperless.consumer] Parsing M_Series_User_Guide.pdf...

[2024-02-29 17:45:42,618] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-ngxq78cjss6/M_Series_User_Guide.pdf'), 'output_file': PosixPath('/tmp/paperless/paperless-f5nbuh0f/archive.pdf'), 'use_threads': True, 'jobs': 12, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'color_conversion_strategy': 'RGB', 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-f5nbuh0f/sidecar.txt')}

[2024-02-29 17:45:43,344] [WARNING] [paperless.parsing.tesseract] Ghostscript PDF/A rendering failed, consider setting PAPERLESS_OCR_USER_ARGS: '{"continue_on_soft_render_error": true}'

[2024-02-29 17:45:43,347] [ERROR] [paperless.consumer] Error occurred while consuming document M_Series_User_Guide.pdf: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.

Traceback (most recent call last):

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_exec/ghostscript.py", line 269, in generate_pdfa

    p = run_polling_stderr(

        ^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/subprocess/__init__.py", line 115, in run_polling_stderr

    raise CalledProcessError(proc.returncode, args, output=None, stderr=stderr)

subprocess.CalledProcessError: Command '['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.8vntmd4i/fix_docinfo.pdf', '/tmp/ocrmypdf.io.8vntmd4i/pdfa.ps']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 363, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/api.py", line 375, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_pipelines/ocr.py", line 225, in run_pipeline

    return _run_pipeline(options, plugin_manager)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_pipelines/ocr.py", line 192, in _run_pipeline

    optimize_messages = exec_concurrent(context, executor)

                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_pipelines/ocr.py", line 148, in exec_concurrent

    pdf, messages = postprocess(pdf, context, executor)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_pipelines/_common.py", line 420, in postprocess

    pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)

              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_pipeline.py", line 814, in convert_to_pdfa

    context.plugin_manager.hook.generate_pdfa(

  File "/usr/local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501, in __call__

    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/site-packages/pluggy/_manager.py", line 119, in _hookexec

    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/site-packages/pluggy/_callers.py", line 138, in _multicall

    raise exception.with_traceback(exception.__traceback__)

  File "/usr/local/lib/python3.11/site-packages/pluggy/_callers.py", line 102, in _multicall

    res = hook_impl.function(*args)

          ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 128, in generate_pdfa

    ghostscript.generate_pdfa(

  File "/usr/local/lib/python3.11/site-packages/ocrmypdf/_exec/ghostscript.py", line 283, in generate_pdfa

    raise SubprocessOutputError('Ghostscript PDF/A rendering failed') from e

ocrmypdf.exceptions.SubprocessOutputError: Ghostscript PDF/A rendering failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/lib/python3.11/site-packages/asgiref/sync.py", line 349, in main_wrap

    raise exc_info[1]

  File "/usr/src/paperless/src/documents/consumer.py", line 516, in try_consume_file

    document_parser.parse(self.working_copy, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 386, in parse

    raise ParseError(

documents.parsers.ParseError: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.
jbarlow83 commented 7 months ago

Using ocrmypdf 16 + Ghostscript 10.02.1, this file can be processed without issue.

It's very likely the error is due to the use of Ghostscript 10.00.0 through 10.02.0, which all contain serious regressions that corrupt PDFs. ocrmypdf 15.4 does not know about these issues.