ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.22k stars 1.02k forks source link

[Bug]: The generated PDF is INVALID #1367

Closed user1823 closed 3 months ago

user1823 commented 3 months ago

Describe the bug

The generated PDF file has black coloured boxes in place of the images.

Steps to reproduce

1. Run ocrmypdf -v1 --output-type pdf --max-image-mpixels 1000 --tesseract-downsample-above 3508 --redo-ocr in.pdf ocr.pdf
2. Open ocr.pdf
3. See that the generated PDF file has black coloured boxes in place of the images.

Files

in.pdf (Same as in https://github.com/ocrmypdf/OCRmyPDF/issues/1361)

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.4.3

Relevant log output

ocrmypdf 16.4.3                                                                                           __main__.py:59
Running: ['C:\\Program Files\\Tesseract-OCR\\tesseract.EXE', '--version']                                __init__.py:133
Found tesseract 5.3.4.20240503                                                                           __init__.py:343
Running: ['C:\\Program Files\\Tesseract-OCR\\tesseract.EXE', '--version']                                __init__.py:133
Running: ['C:\\Program Files\\Tesseract-OCR\\tesseract.EXE', '--version']                                __init__.py:133
Running: ['C:\\Program Files\\gs\\gs10.03.1\\bin\\gswin64c.EXE', '--version']                            __init__.py:133
Found gs 10.3.1                                                                                          __init__.py:343
Running: ['C:\\Program Files\\gs\\gs10.03.1\\bin\\gswin64c.EXE', '--version']                            __init__.py:133
Running: ['C:\\Program Files\\Tesseract-OCR\\tesseract.EXE', '--list-langs']                             __init__.py:133
stdout/stderr = List of available languages in "C:\Program Files\Tesseract-OCR/tessdata/" (2):            __init__.py:73
eng
osd

No language specified; assuming --language eng                                                         _validation.py:54
pikepdf mmap enabled                                                                                      helpers.py:328
Gathering info with 1 thread workers                                                                         info.py:800
pikepdf mmap enabled                                                                                      helpers.py:328
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3                                                               tesseract_ocr.py:199
pikepdf mmap enabled                                                                                      helpers.py:328
    1 redoing OCR                                                                                       _pipeline.py:327
    1 Rasterize with png16m, rotation 0                                                                 _pipeline.py:539
    1 Weighted average image DPI is 175.4, max DPI is 600.0. The discrepancy may indicate a high detail _pipeline.py:477
region on this page, but could also indicate a problem with the input PDF file. Page image will be
rendered at 400.0 DPI.
    1 Running: ['C:\\Program Files\\gs\\gs10.03.1\\bin\\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH',  __init__.py:133
'-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1',
'-r400.000000x400.000000', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None',
'-f', 'C:\\Users\\User\\AppData\\Local\\Temp\\ocrmypdf.io.sjdcj1hj\\origin.pdf']
    1 Rotating output by 0                                                                            ghostscript.py:149
    1 resolution (399.9992, 399.9992)                                                                   _pipeline.py:618
    1 blanking (1098.0533594444444, 192.10619402777775, 1739.7076316888886, 250.43941069444463)         _pipeline.py:642
    1 blanking (1354.719512777778, 192.10619402777775, 1938.0116795244444, 250.43941069444463)          _pipeline.py:642
    1 blanking (1573.4412975555556, 192.10619402777775, 2157.0334637022224, 250.43941069444463)         _pipeline.py:642
    1 blanking (1845.4407535555556, 192.10619402777775, 2369.63526072, 250.43941069444463)              _pipeline.py:642
    1 blanking (2044.8292436666666, 192.10619402777775, 2765.05446988, 250.43941069444463)              _pipeline.py:642
    1 blanking (3003.3273266666665, 200.20895560000008, 3186.65362668, 261.3199444888887)               _pipeline.py:642
    1 blanking (57.333218666666944, 450.2862332222221, 821.4983570000003, 519.7305387777778)            _pipeline.py:642
    1 blanking (2604.550346444444, 475.0972947111122, 3064.9527589711106, 528.874964933334)             _pipeline.py:642
    1 blanking (2863.383162111111, 478.7083986000007, 3165.314780468889, 532.6527351555569)             _pipeline.py:642
    1 blanking (2293.1620803333335, 362.6030752555562, 3440.4375635555552, 534.8249530333342)           _pipeline.py:642
    1 blanking (55.38877811111126, 657.3108191722231, 2505.039417688922, 756.3106211722225)             _pipeline.py:642
    1 blanking (1729.9965399999999, 821.5327129500006, 2784.563319751111, 1011.6434438388897)           _pipeline.py:642
    1 blanking (54.0554474444447, 827.088257394445, 1547.524677166678, 1028.9211870611111)              _pipeline.py:642
    1 blanking (46.38879611111131, 1189.0792000777774, 620.0543154444447, 1504.3091251722217)           _pipeline.py:642
    1 blanking (1729.7743182222223, 1279.1984642833336, 2502.800549944445, 1569.6423278388897)          _pipeline.py:642
    1 blanking (1952.7183167777778, 1889.8833540222226, 1996.996006, 1912.1055318000008)                _pipeline.py:642
    1 blanking (2178.0511994444446, 1885.2166966888894, 2245.606619888889, 1935.4388184666673)          _pipeline.py:642
    1 blanking (2066.0514234444445, 1883.883366022223, 2120.9402025555555, 1936.772149133334)           _pipeline.py:642
    1 blanking (1991.7182387777777, 1946.8110179444452, 2367.9952639999997, 1996.8720289333337)         _pipeline.py:642
    1 blanking (1928.051699444444, 2000.4942439111114, 1999.6071118888885, 2022.7164216888896)          _pipeline.py:642
    1 blanking (2046.1070188888884, 2000.4942439111114, 2164.440115555555, 2022.7164216888896)          _pipeline.py:642
    1 blanking (47.33323866666687, 1741.7530947277774, 691.5225058411113, 2029.363630616666)            _pipeline.py:642
    1 blanking (1731.3298706666665, 2105.3634786166667, 2480.470594604444, 2289.4186660611113)          _pipeline.py:642
    1 blanking (45.99990800000024, 2493.362702616666, 867.865486487778, 2785.528784949999)              _pipeline.py:642
    1 blanking (1733.996532, 2835.862017616667, 2863.6092727699997, 3040.139386838889)                  _pipeline.py:642
    1 blanking (47.33323866666688, 3248.5278589499994, 861.0738334044444, 3647.0270619499997)           _pipeline.py:642
    1 blanking (1735.3298626666663, 3387.3609146166673, 3123.8848633288885, 3680.3047731722227)         _pipeline.py:642
    1 blanking (52.666561333333334, 4001.415242061111, 882.3526797355555, 4241.692539283334)            _pipeline.py:642
    1 blanking (64.66653733333334, 4311.145178155555, 167.55522044444444, 4436.922704377777)            _pipeline.py:642
    1 Resizing image to fit image dimensions limit                                                        imageops.py:56
    1 Rescaled image to (2479, 3508) pixels and (300, 300) dpi                                           imageops.py:151
    1 Running: ['C:\\Program Files\\Tesseract-OCR\\tesseract.EXE', '-l', 'eng',                          __init__.py:133
'C:\\Users\\User\\AppData\\Local\\Temp\\ocrmypdf.io.sjdcj1hj\\000001_ocr.png',
'C:\\Users\\User\\AppData\\Local\\Temp\\ocrmypdf.io.sjdcj1hj\\000001_ocr_hocr', 'hocr', 'txt']
    1 pikepdf.Matrix(0.18, 0, 0, -0.18, 0, 631.44)                                                          _hocr.py:203
    1 pikepdf.Matrix(1, 0, 0, 1, 0, 3508)                                                                   _hocr.py:323
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                     _graft.py:140
    1 Grafting                                                                                             _graft.py:251
    1 Grafting with ctm pikepdf.Matrix(1.33414, 0, 0, 1.33352, 0, -5.68434e-14)                            _graft.py:294
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                 _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                             ocr.py:144
Running: ['C:\\Program Files\\Tesseract-OCR\\tesseract.EXE', '--version']                                __init__.py:133
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
xref 91: skipping image because it is an SMask                                                           optimize.py:280
xref 61: treating as an optimization candidate                                                           optimize.py:282
xref 59: treating as an optimization candidate                                                           optimize.py:282
xref 55: treating as an optimization candidate                                                           optimize.py:282
xref 58: treating as an optimization candidate                                                           optimize.py:282
xref 60: treating as an optimization candidate                                                           optimize.py:282
xref 63: treating as an optimization candidate                                                           optimize.py:282
xref 57: treating as an optimization candidate                                                           optimize.py:282
xref 64: treating as an optimization candidate                                                           optimize.py:282
Recursing into Form XObject /OCR-abobtG7FVsyUsYKR26nmAA in page 0                                        optimize.py:265
xref 62: treating as an optimization candidate                                                           optimize.py:282
xref 56: treating as an optimization candidate                                                           optimize.py:282
xref 64: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
XrefExt(xref=64, ext='.png')                                                                             optimize.py:347
xref 55: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
XrefExt(xref=55, ext='.png')                                                                             optimize.py:347
XrefExt(xref=56, ext='.png')                                                                             optimize.py:347
XrefExt(xref=57, ext='.png')                                                                             optimize.py:347
XrefExt(xref=58, ext='.png')                                                                             optimize.py:347
xref 59: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
XrefExt(xref=59, ext='.png')                                                                             optimize.py:347
xref 60: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
XrefExt(xref=60, ext='.png')                                                                             optimize.py:347
XrefExt(xref=61, ext='.png')                                                                             optimize.py:347
XrefExt(xref=62, ext='.png')                                                                             optimize.py:347
xref 63: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
XrefExt(xref=63, ext='.png')                                                                             optimize.py:347
Optimizable images: JPEGs: 0 PNGs: 10                                                                    optimize.py:352
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
xref 91: skipping image because it is an SMask                                                           optimize.py:280
xref 61: treating as an optimization candidate                                                           optimize.py:282
xref 59: treating as an optimization candidate                                                           optimize.py:282
xref 55: treating as an optimization candidate                                                           optimize.py:282
xref 58: treating as an optimization candidate                                                           optimize.py:282
xref 60: treating as an optimization candidate                                                           optimize.py:282
xref 63: treating as an optimization candidate                                                           optimize.py:282
xref 57: treating as an optimization candidate                                                           optimize.py:282
xref 64: treating as an optimization candidate                                                           optimize.py:282
Recursing into Form XObject /OCR-abobtG7FVsyUsYKR26nmAA in page 0                                        optimize.py:265
xref 62: treating as an optimization candidate                                                           optimize.py:282
xref 56: treating as an optimization candidate                                                           optimize.py:282
xref 64: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 64: marking this JPEG as deflatable                                                                 optimize.py:547
xref 55: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 55: marking this JPEG as deflatable                                                                 optimize.py:547
xref 56: marking this JPEG as deflatable                                                                 optimize.py:547
xref 57: marking this JPEG as deflatable                                                                 optimize.py:547
xref 58: marking this JPEG as deflatable                                                                 optimize.py:547
xref 59: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 59: marking this JPEG as deflatable                                                                 optimize.py:547
xref 60: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 60: marking this JPEG as deflatable                                                                 optimize.py:547
xref 61: marking this JPEG as deflatable                                                                 optimize.py:547
xref 62: marking this JPEG as deflatable                                                                 optimize.py:547
xref 63: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 63: marking this JPEG as deflatable                                                                 optimize.py:547
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 10/10 0:00:00
xref 91: skipping image because it is an SMask                                                           optimize.py:280
xref 61: treating as an optimization candidate                                                           optimize.py:282
xref 59: treating as an optimization candidate                                                           optimize.py:282
xref 55: treating as an optimization candidate                                                           optimize.py:282
xref 58: treating as an optimization candidate                                                           optimize.py:282
xref 60: treating as an optimization candidate                                                           optimize.py:282
xref 63: treating as an optimization candidate                                                           optimize.py:282
xref 57: treating as an optimization candidate                                                           optimize.py:282
xref 64: treating as an optimization candidate                                                           optimize.py:282
Recursing into Form XObject /OCR-abobtG7FVsyUsYKR26nmAA in page 0                                        optimize.py:265
xref 62: treating as an optimization candidate                                                           optimize.py:282
xref 56: treating as an optimization candidate                                                           optimize.py:282
xref 64: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 55: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 56: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 57: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 58: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 59: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 60: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 61: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 62: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
xref 63: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                  optimize.py:98
Optimizable images: JBIG2 groups: 0                                                                      optimize.py:363
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Running: ['C:\\jbig2enc-0.29\\jbig2.EXE', '--version']                                                   __init__.py:133
Running: ['C:\\pngquant\\pngquant.EXE', '--version']                                                     __init__.py:133
Image optimization ratio: 1.02 savings: 2.1%                                                            _pipeline.py:989
Total file size ratio: 1.03 savings: 3.0%                                                               _pipeline.py:992
C:\Users\User\AppData\Local\Temp\ocrmypdf.io.sjdcj1hj\optimize.pdf -> ocr.pdf                         _pipeline.py:1064
WARNING: ocr.pdf (offset 197554): error decoding stream data for object 59 0: Not a JPEG file: starts     helpers.py:277
with 0x48 0x89
WARNING: ocr.pdf (offset 197554): stream will be re-processed without filtering to avoid data loss        helpers.py:285
WARNING: ocr.pdf (offset 260552): error decoding stream data for object 60 0: Not a JPEG file: starts     helpers.py:277
with 0x48 0x89
WARNING: ocr.pdf (offset 260552): stream will be re-processed without filtering to avoid data loss        helpers.py:285
WARNING: ocr.pdf (offset 468726): error decoding stream data for object 63 0: Not a JPEG file: starts     helpers.py:277
with 0x48 0x89
WARNING: ocr.pdf (offset 468726): stream will be re-processed without filtering to avoid data loss        helpers.py:285
WARNING: ocr.pdf (offset 522538): error decoding stream data for object 64 0: Not a JPEG file: starts     helpers.py:277
with 0x48 0x89
WARNING: ocr.pdf (offset 522538): stream will be re-processed without filtering to avoid data loss        helpers.py:285
Output file: The generated PDF is INVALID
jbarlow83 commented 3 months ago

The four JPEG images that appear in the warnings seem to be corrupt, but in a way that is correctable. OCRmyPDF is reporting a real issue.

If processed with --optimize 3 OCRmyPDF will reconstruct them, fixing the corruption and producing no errors.

I won't change this, because decisions about this type of error should be handle on a case by case basis.

user1823 commented 3 months ago

If processed with --optimize 3 OCRmyPDF will reconstruct them, fixing the corruption and producing no errors.

Thanks. But how would a user know this?

I think ocrmypdf should tell the user to try using --optimize 3 or try re-writing the file using GhostScript before using ocrmypdf, which also fixes the issue.

By the way, won't it be better for ocrmypdf to do what GS is doing to the corrupted images when the input file is re-written with GS?

I am not familiar with the technical details but it seems that the GS approach would be better than making the user use --optimize 3 considering that the documentation says "enables more aggressive optimizations and targets lower image quality." for --optimize 3.

jbarlow83 commented 3 months ago

It's not planned behavior for optimize to fix this issue, it just happens to work as a side effect. I realize it may not suitable for all cases.

If other users report the same sort of issue you see or if there's a consistent source of these files from somewhere (e.g. if you can tell me that saving a file with setting X in Acrobat DC 2024 always produces this error) then I could see adding special behavior to detect and fix. But it could be just a one-off PDF produced by buggy software from many years ago.

BigDi commented 2 months ago

I used Paperless-ngx and that use OCRmypdf for OCR. I get same error while scanning pages with my Canon LiDE 220 scanner. After OCR this pages I got an empty page in Paperless. [2024-09-14 15:56:51,505] [ERROR] [ocrmypdf.helpers] WARNING: /tmp/paperless/paperless-apebr9c4/archive.pdf (offset 5272): error decoding stream data for object 12 0: Not a JPEG file: starts with 0x78 0x01 [2024-09-14 15:56:51,506] [WARNING] [ocrmypdf.helpers] WARNING: /tmp/paperless/paperless-apebr9c4/archive.pdf (offset 5272): stream will be re-processed without filtering to avoid data loss [2024-09-14 15:56:51,507] [WARNING] [ocrmypdf._pipelines._common] Output file: The generated PDF is INVALID