py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.32k stars 1.41k forks source link

BUG: Extracted JPEG data seems to end prematurely #2266

Closed michelcrypt4d4mus closed 6 months ago

michelcrypt4d4mus commented 1 year ago

I'm not 100% sure this is a PyPDF issue though I suspect it is a regression introduced single 3.14.0 because this never used to happen in my application and now it happens quite frequently despite both the calling code and the PyTesseract package being unchanged though there's at least a small chance there's some issue in the underlying Tesseract binary.

Environment

$ python -m platform
3.11.5

$ python -c "import pypdf;print(pypdf._debug_versions)"
3.16.4

Code + PDF

The code is here, in particular these lines where a PIL.Image object is extracted from the PDF:

for image_number, image in enumerate(page.images, start=1):
    image_obj = Image.open(io.BytesIO(image.data))

produce a PIL.Image object that is passed to PyTesseract here:

text = pytesseract.image_to_string(image)

PyTesseract then fails with this:

TesseractError: (1, 'Corrupt JPEG data: premature end of data segment Error in pixReadStreamJpeg: read error at scanline 2206; nwarn = 1 Error in pixReadStreamJpeg: bad data Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Error during processing.')

PDF file

You can use the PDF file in tests. FTX Claim SC30 01072023101624File595287144.pdf

Traceback

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/uzor/workspace/clown_sort/clown_sort/files/image_file.py:123 in extract_text     │
│                                                                                                  │
│   120 │   │   text = None                                                                        │
│   121 │   │                                                                                      │
│   122 │   │   try:                                                                               │
│ ❱ 123 │   │   │   text = pytesseract.image_to_string(image)                                      │
│   124 │   │   except pytesseract.pytesseract.TesseractError as e:                                │
│   125 │   │   │   console.print_exception()                                                      │
│   126 │   │   │   console.print(warning_text(f"Tesseract OCR failure '{image_name}'! No OCR te   │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:423 in image_to_string                               │
│                                                                                                  │
│   420 │   """                                                                                    │
│   421 │   args = [image, 'txt', lang, config, nice, timeout]                                     │
│   422 │                                                                                          │
│ ❱ 423 │   return {                                                                               │
│   424 │   │   Output.BYTES: lambda: run_and_get_output(*(args + [True])),                        │
│   425 │   │   Output.DICT: lambda: {'text': run_and_get_output(*args)},                          │
│   426 │   │   Output.STRING: lambda: run_and_get_output(*args),                                  │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:426 in <lambda>                                      │
│                                                                                                  │
│   423 │   return {                                                                               │
│   424 │   │   Output.BYTES: lambda: run_and_get_output(*(args + [True])),                        │
│   425 │   │   Output.DICT: lambda: {'text': run_and_get_output(*args)},                          │
│ ❱ 426 │   │   Output.STRING: lambda: run_and_get_output(*args),                                  │
│   427 │   }[output_type]()                                                                       │
│   428                                                                                            │
│   429                                                                                            │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:288 in run_and_get_output                            │
│                                                                                                  │
│   285 │   │   │   'timeout': timeout,                                                            │
│   286 │   │   }                                                                                  │
│   287 │   │                                                                                      │
│ ❱ 288 │   │   run_tesseract(**kwargs)                                                            │
│   289 │   │   filename = f"{kwargs['output_filename_base']}{extsep}{extension}"                  │
│   290 │   │   with open(filename, 'rb') as output_file:                                          │
│   291 │   │   │   if return_bytes:                                                               │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:264 in run_tesseract                                 │
│                                                                                                  │
│   261 │                                                                                          │
│   262 │   with timeout_manager(proc, timeout) as error_string:                                   │
│   263 │   │   if proc.returncode:                                                                │
│ ❱ 264 │   │   │   raise TesseractError(proc.returncode, get_errors(error_string))                │
│   265                                                                                            │
│   266                                                                                            │
│   267 def run_and_get_output(                                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TesseractError: (1, 'Corrupt JPEG data: premature end of data segment Error in pixReadStreamJpeg: read error at scanline 2206; nwarn = 1 Error in pixReadStreamJpeg: bad 
data Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Error during processing.')
michelcrypt4d4mus commented 1 year ago

When I downgrade to 3.14.0 this issue goes away so I think it can be confirmed as a regression. Here's a file that was failing in 3.16.4 but working fine in 3.14.0 (also usable for tests): FTX Claim Skybridge Capital 30062023113350File971325116.pdf

pubpub-zz commented 7 months ago

images from FTX Claim SC30 01072023101624File595287144.pdf : iss2266a_images.zip

pubpub-zz commented 7 months ago

images from FTX.Claim.Skybridge.Capital.30062023113350File971325116.pdf iss2266b_images.zip

pubpub-zz commented 7 months ago

@michelcrypt4d4mus Can you please indicate the exact images that used to fail: Checking all images during checks is too much time consuming