Closed user1823 closed 3 months ago
The four JPEG images that appear in the warnings seem to be corrupt, but in a way that is correctable. OCRmyPDF is reporting a real issue.
If processed with --optimize 3
OCRmyPDF will reconstruct them, fixing the corruption and producing no errors.
I won't change this, because decisions about this type of error should be handle on a case by case basis.
If processed with
--optimize 3
OCRmyPDF will reconstruct them, fixing the corruption and producing no errors.
Thanks. But how would a user know this?
I think ocrmypdf should tell the user to try using --optimize 3
or try re-writing the file using GhostScript before using ocrmypdf, which also fixes the issue.
By the way, won't it be better for ocrmypdf to do what GS is doing to the corrupted images when the input file is re-written with GS?
I am not familiar with the technical details but it seems that the GS approach would be better than making the user use --optimize 3
considering that the documentation says "enables more aggressive optimizations and targets lower image quality." for --optimize 3
.
It's not planned behavior for optimize to fix this issue, it just happens to work as a side effect. I realize it may not suitable for all cases.
If other users report the same sort of issue you see or if there's a consistent source of these files from somewhere (e.g. if you can tell me that saving a file with setting X in Acrobat DC 2024 always produces this error) then I could see adding special behavior to detect and fix. But it could be just a one-off PDF produced by buggy software from many years ago.
I used Paperless-ngx and that use OCRmypdf for OCR. I get same error while scanning pages with my Canon LiDE 220 scanner. After OCR this pages I got an empty page in Paperless.
[2024-09-14 15:56:51,505] [ERROR] [ocrmypdf.helpers] WARNING: /tmp/paperless/paperless-apebr9c4/archive.pdf (offset 5272): error decoding stream data for object 12 0: Not a JPEG file: starts with 0x78 0x01 [2024-09-14 15:56:51,506] [WARNING] [ocrmypdf.helpers] WARNING: /tmp/paperless/paperless-apebr9c4/archive.pdf (offset 5272): stream will be re-processed without filtering to avoid data loss [2024-09-14 15:56:51,507] [WARNING] [ocrmypdf._pipelines._common] Output file: The generated PDF is INVALID
Describe the bug
The generated PDF file has black coloured boxes in place of the images.
Steps to reproduce
Files
in.pdf (Same as in https://github.com/ocrmypdf/OCRmyPDF/issues/1361)
How did you download and install the software?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
16.4.3
Relevant log output