ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.1k stars 1.02k forks source link

[Bug]: "Corrupt JPEG data: premature end of data segment" with some files #1269

Closed macdeport closed 4 months ago

macdeport commented 8 months ago

Describe the bug

Corrupt JPEG data: premature end of data segment at the end of run with some PDF files. However, the files produced by OCRmyPDF are perfectly usable.

Steps to reproduce

Run ocrmypdf -l fra --user-words dictionary-fr.txt-alias --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt -v 1 bid$.pdf bid.pdf

Files

bid$.pdf

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.1.1

Relevant log output

$ocrmypdf -l fra --user-words dictionary-fr.txt-alias --pdf-renderer hocr --output-type pdf -O2 --jbig2-lossy --skip-text --sidecar bid.txt -v 1 bid$.pdf bid.pdf
ocrmypdf 16.1.1                                                                                                       __main__.py:59
Running: ['tesseract', '--version']                                                                                  __init__.py:133
Found tesseract 5.3.3                                                                                                __init__.py:342
Running: ['tesseract', '--version']                                                                                  __init__.py:133
Running: ['pngquant', '--version']                                                                                   __init__.py:133
Found pngquant 2.18.0                                                                                                __init__.py:342
Running: ['jbig2', '--version']                                                                                      __init__.py:133
Found jbig2 0.28                                                                                                     __init__.py:342
Running: ['gs', '--version']                                                                                         __init__.py:133
Found gs 10.2.1                                                                                                      __init__.py:342
Running: ['gs', '--version']                                                                                         __init__.py:133
Running: ['tesseract', '--list-langs']                                                                               __init__.py:133
stdout/stderr = List of available languages in "/opt/local/share/tessdata/" (3):                                      __init__.py:73
eng                                                                                                                                 
fra                                                                                                                                 
osd                                                                                                                                 

pikepdf mmap enabled                                                                                                  helpers.py:326
os.symlink(bid$.pdf, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin)                    helpers.py:179
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin,                              helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/origin.pdf)                                                   
Gathering info with 1 thread workers                                                                                     info.py:772
pikepdf mmap enabled                                                                                                  helpers.py:326
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3                                                                           tesseract_ocr.py:183
pikepdf mmap enabled                                                                                                  helpers.py:326
    1 skipping all processing on this page                                                                          _pipeline.py:319
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                 _graft.py:140
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                             _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/sidecar.txt -> bid.txt                       _pipeline.py:1051
Postprocessing...                                                                                                         ocr.py:146
Running: ['tesseract', '--version']                                                                                  __init__.py:133
xref 491: skipping image because it is an SMask                                                                      optimize.py:277
xref 297: treating as an optimization candidate                                                                      optimize.py:279
xref 490: skipping image because it is an SMask                                                                      optimize.py:277
xref 296: treating as an optimization candidate                                                                      optimize.py:279
xref 492: skipping image because it is an SMask                                                                      optimize.py:277
xref 298: treating as an optimization candidate                                                                      optimize.py:279
xref 299: treating as an optimization candidate                                                                      optimize.py:279
XrefExt(xref=298, ext='.jpg')                                                                                        optimize.py:344
Optimizable images: JPEGs: 1 PNGs: 0                                                                                 optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
xref 491: skipping image because it is an SMask                                                                      optimize.py:277
xref 297: treating as an optimization candidate                                                                      optimize.py:279
xref 490: skipping image because it is an SMask                                                                      optimize.py:277
xref 296: treating as an optimization candidate                                                                      optimize.py:279
xref 492: skipping image because it is an SMask                                                                      optimize.py:277
xref 298: treating as an optimization candidate                                                                      optimize.py:279
xref 299: treating as an optimization candidate                                                                      optimize.py:279
xref 298: marking this JPEG as deflatable                                                                            optimize.py:544
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
PNGs                  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
xref 491: skipping image because it is an SMask                                                                      optimize.py:277
xref 297: treating as an optimization candidate                                                                      optimize.py:279
xref 490: skipping image because it is an SMask                                                                      optimize.py:277
xref 296: treating as an optimization candidate                                                                      optimize.py:279
xref 492: skipping image because it is an SMask                                                                      optimize.py:277
xref 298: treating as an optimization candidate                                                                      optimize.py:279
xref 299: treating as an optimization candidate                                                                      optimize.py:279
xref 298: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                             optimize.py:97
Optimizable images: JBIG2 groups: 0                                                                                  optimize.py:360
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.opt.pdf,                    helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.pdf)                                                 
Running: ['jbig2', '--version']                                                                                      __init__.py:133
Running: ['pngquant', '--version']                                                                                   __init__.py:133
Image optimization ratio: 1.24 savings: 19.4%                                                                       _pipeline.py:976
Total file size ratio: 1.67 savings: 40.1%                                                                          _pipeline.py:979
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.h0iazvq6/optimize.pdf -> bid.pdf                      _pipeline.py:1051
Corrupt JPEG data: premature end of data segment
$
jbarlow83 commented 4 months ago

I could not reproduce this with current versions. I also extracted the JPEG embedded in the output file, and it appears to be well-formed according to the djpeg application.

Perhaps libjpeg needs to be upgraded on your machine?

macdeport commented 2 months ago

Persists in spite of this new configuration:

Python 3.11.9 / ocrmypdf 16.4.3 / pikepdf 9.2.0 / pypdf 4.3.1
jbig2 0.28 / gs 10.03.1 / pngquant 3.0.3
tesseract 5.3.3
 leptonica-1.84.1
  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2
 Found NEON
 Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.4.6 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6
 Found libcurl/8.9.1 OpenSSL/3.3.1 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 nghttp2/1.62.1

python311 @3.11.9_0+lto+optimizations (active)
ocrmypdf @16.4.3_0+python311 (active)
py311-pikepdf @9.2.0_0 (active)
py311-pypdf @4.3.1_0 (active)
---
jbig2dec @0.20_0 (active)
ghostscript @10.03.1_0+x11 (active)
pngquant @3.0.3_0 (active)
tesseract @5.3.3_2 (active)
macdeport commented 2 months ago

Still exists with this new configuration ???:

Python 3.11.9 / ocrmypdf 16.5.0 / pikepdf 9.2.1 / pypdf 4.3.1
jbig2 0.28 / gs 10.03.1 / pngquant 3.0.3 / tesseract 5.4.1
 leptonica-1.84.1
  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2
 Found NEON
 Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.4.6 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6
 Found libcurl/8.9.1 OpenSSL/3.3.1 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 nghttp2/1.63.0

python311 @3.11.9_0+lto+optimizations (active)
ocrmypdf @16.5.0_0+python311 (active)
py311-pikepdf @9.2.1_0 (active)
py311-pypdf @4.3.1_0 (active)
---
jbig2dec @0.20_0 (active)
ghostscript @10.03.1_0+x11 (active)
pngquant @3.0.3_0 (active)
tesseract @5.4.1_2 (active)