ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
12.78k stars 935 forks source link

Output file images are corrupted #1345

Open robmclear opened 4 days ago

robmclear commented 4 days ago

Discussed in https://github.com/ocrmypdf/OCRmyPDF/discussions/1340

Originally posted by **robmclear** June 23, 2024 Hi, Apologies if this is the wrong place to put this, I wasn't sure if I should post here or put it into the Issues area. I am running a simple bash script to automate processing new PDF files: ocrmypdf -f -q --optimize 1 --output-type pdf "$1" "$1" The script has worked great for a long time, but now new PDF files have their image information corrupted on output. I'm attaching a test file pre and post-processing, along with the output of OCRmyPDF when I run the command manually. I'll add a zip archive of the OCRmyPDF files generated using the -k option. Thanks in advance for any help troubleshooting this. [ocrmypdf.io.ptlo7ek6.zip](https://github.com/user-attachments/files/15945920/ocrmypdf.io.ptlo7ek6.zip) [Terminal Output.txt](https://github.com/user-attachments/files/15945921/Terminal.Output.txt) [test7.pdf](https://github.com/user-attachments/files/15945922/test7.pdf) [test7processed.pdf](https://github.com/user-attachments/files/15945923/test7processed.pdf)
jbarlow83 commented 3 days ago

I'm not sure what's going on, but it seems to me that the input file is corrupt or at least generated in a way that most open source PDF viewers (Ghostscript, Evince, pdftoppm) all cannot view it. This is what they "see":

1345 pdftoppm-1

pdf.js, Foxit and Chromium display a cropped image, which I think is what was expected.

I'd say that the output is corrupt, because the input doesn't display correct on several targets.