Closed tsoernes closed 2 days ago
Can't reproduce or investigate without a test file.
I have tried processing 1 page a time, to see if I could isolate the error to a specific page which I might be able to share with you.
Surprisingly, the error is not thrown when processing 1 page at a time.
This throws an error:
ocrmypdf.ocr(pdf_path2, out_path, skip_text=True, color_conversion_strategy="RGB")
While ocring 1 page at a time works:
from pathlib import Path
import subprocess
import ocrmypdf
def extract_pages(path: Path | str, pages: int | str, output_dir: Path | str | None = None) -> Path:
"""
Extract given page, pages, or page range from a PDF file to a new file.
Return the Path of the extracted PDF file
pages: e.g. '6' or '6-10'
"""
path = Path(path)
pages = str(pages)
if not path.exists():
raise FileNotFoundError(path)
if output_dir:
output_dir = Path(output_dir)
else:
output_dir = path.parent
output_path = output_dir / (path.stem + f"_pages_{pages}.pdf")
if output_path.exists():
return output_path
cmd = ["qpdf", str(path), "--pages", ".", pages, "--", str(output_path)]
subprocess.run(cmd, check=True)
return output_path
def test():
page_numbers = Pdf(pdf_path2).page_numbers[:]
for page_number in page_numbers:
page_path = extract_pages(pdf_path2, page_number, pdf_pages_dir)
out_path = Path.tempdir() / page_path.name
print("OCRing", page_path)
ocrmypdf.ocr(page_path, out_path, skip_text=True, color_conversion_strategy="RGB")
Your "page at a time" code, by using qpdf, is correcting some of the errors in the input file before processing the page.
qpdf can't fix the overprint, interpolate true, or font substitution issues.
Describe the bug
No output pdf is produced. The output log is cut off due to its length
Steps to reproduce
Files
The PDF is confidential
How did you download and install the software?
Linux package manager (apt, dnf, etc.)
OCRmyPDF version
15.4.3
Relevant log output