ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Mozilla Public License 2.0
12.78k stars 935 forks source link

Output file images are corrupted #1345

Open robmclear opened 4 days ago

robmclear commented 4 days ago

Discussed in https://github.com/ocrmypdf/OCRmyPDF/discussions/1340

Originally posted by **robmclear** June 23, 2024 Hi, Apologies if this is the wrong place to put this, I wasn't sure if I should post here or put it into the Issues area. I am running a simple bash script to automate processing new PDF files: ocrmypdf -f -q --optimize 1 --output-type pdf "$1" "$1" The script has worked great for a long time, but now new PDF files have their image information corrupted on output. I'm attaching a test file pre and post-processing, along with the output of OCRmyPDF when I run the command manually. I'll add a zip archive of the OCRmyPDF files generated using the -k option. Thanks in advance for any help troubleshooting this. [ocrmypdf.io.ptlo7ek6.zip](https://github.com/user-attachments/files/15945920/ocrmypdf.io.ptlo7ek6.zip) [Terminal Output.txt](https://github.com/user-attachments/files/15945921/Terminal.Output.txt) [test7.pdf](https://github.com/user-attachments/files/15945922/test7.pdf) [test7processed.pdf](https://github.com/user-attachments/files/15945923/test7processed.pdf)
jbarlow83 commented 3 days ago

I'm not sure what's going on, but it seems to me that the input file is corrupt or at least generated in a way that most open source PDF viewers (Ghostscript, Evince, pdftoppm) all cannot view it. This is what they "see":

1345 pdftoppm-1

pdf.js, Foxit and Chromium display a cropped image, which I think is what was expected.

I'd say that the output is corrupt, because the input doesn't display correct on several targets.