ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.02k stars 1.01k forks source link

[Bug]: There was an error in an annotation | Setting Overprint Mode to 1 not permitted in PDF/A-2, overprint mode not set #1414

Closed tsoernes closed 2 days ago

tsoernes commented 3 days ago

Describe the bug

No output pdf is produced. The output log is cut off due to its length

...

GPL Ghostscript 10.02.1: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

The following errors were encountered at least once while processing this file:
        There was an error in an annotation

 This file had errors that were repaired or ignored.                                                                                                           ghostscript.py:294

 The file was produced by:                                                                                                                                     ghostscript.py:294

 >>>> PDFlib+PDI 9.1.2p1 (PHP7/Linux-x86_64) <<<<                                                                                                              ghostscript.py:294

 Please notify the author of the software that produced this                                                                                                   ghostscript.py:294

 file that it does not conform to Adobe's published PDF                                                                                                        ghostscript.py:294

 specification.                                                                                                                                                ghostscript.py:294

ColorConversionNeededError: The input PDF has an unusual color space. Use                                                                                          _common.py:261
--color-conversion-strategy to convert to a common color space
such as RGB, or use --output-type pdf to skip PDF/A conversion
and retain the original color space.

Steps to reproduce

1. Run `ocrmypdf --skip-text input.pdf output.pdf`

Fedora 40
ghostscript.x86_64                                   10.02.1-12.fc40

Files

The PDF is confidential

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

15.4.3

Relevant log output

GPL Ghostscript 10.02.1: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

Page 199
Page 200
GPL Ghostscript 10.02.1: PDFA doesn't allow images with Interpolate true.
...
GPL Ghostscript 10.02.1: PDFA doesn't allow images with Interpolate true.
Page 212
Page 213
Loading font Arial-BoldMT (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusSans-Bold
GPL Ghostscript 10.02.1: PDFA doesn't allow images with Interpolate true.
...
Page 219
GPL Ghostscript 10.02.1: Setting Overprint Mode to 1
...
GPL Ghostscript 10.02.1: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

Loading font Arial-BoldMT (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusSans-Bold
Loading font Arial-BoldItalicMT (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusSans-BoldItalic
Page 221
GPL Ghostscript 10.02.1: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

....
GPL Ghostscript 10.02.1: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

The following errors were encountered at least once while processing this file:
        There was an error in an annotation

 This file had errors that were repaired or ignored.                                                                                                           ghostscript.py:294

 The file was produced by:                                                                                                                                     ghostscript.py:294

 >>>> PDFlib+PDI 9.1.2p1 (PHP7/Linux-x86_64) <<<<                                                                                                              ghostscript.py:294

 Please notify the author of the software that produced this                                                                                                   ghostscript.py:294

 file that it does not conform to Adobe's published PDF                                                                                                        ghostscript.py:294

 specification.                                                                                                                                                ghostscript.py:294

ColorConversionNeededError: The input PDF has an unusual color space. Use                                                                                          _common.py:261
--color-conversion-strategy to convert to a common color space
such as RGB, or use --output-type pdf to skip PDF/A conversion
and retain the original color space.
jbarlow83 commented 2 days ago

Can't reproduce or investigate without a test file.

tsoernes commented 2 days ago

I have tried processing 1 page a time, to see if I could isolate the error to a specific page which I might be able to share with you.

Surprisingly, the error is not thrown when processing 1 page at a time.

This throws an error:

    ocrmypdf.ocr(pdf_path2, out_path, skip_text=True, color_conversion_strategy="RGB")

While ocring 1 page at a time works:

from pathlib import Path
import subprocess
import ocrmypdf

def extract_pages(path: Path | str, pages: int | str, output_dir: Path | str | None = None) -> Path:
    """
    Extract given page, pages, or page range from a PDF file to a new file.
    Return the Path of the extracted PDF file

    pages: e.g. '6' or '6-10'
    """
    path = Path(path)
    pages = str(pages)
    if not path.exists():
        raise FileNotFoundError(path)
    if output_dir:
        output_dir = Path(output_dir)
    else:
        output_dir = path.parent
    output_path = output_dir / (path.stem + f"_pages_{pages}.pdf")
    if output_path.exists():
        return output_path
    cmd = ["qpdf", str(path), "--pages", ".", pages, "--", str(output_path)]
    subprocess.run(cmd, check=True)
    return output_path

def test():
    page_numbers = Pdf(pdf_path2).page_numbers[:]
    for page_number in page_numbers:
        page_path = extract_pages(pdf_path2, page_number, pdf_pages_dir)
        out_path = Path.tempdir() / page_path.name
        print("OCRing", page_path)
        ocrmypdf.ocr(page_path, out_path, skip_text=True, color_conversion_strategy="RGB")
jbarlow83 commented 2 days ago

Your "page at a time" code, by using qpdf, is correcting some of the errors in the input file before processing the page.

qpdf can't fix the overprint, interpolate true, or font substitution issues.