py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.37k stars 1.41k forks source link

CMYK image with filter_type equal to flate_decode return "not enough image data" error #2321

Closed jianfan123 closed 11 months ago

jianfan123 commented 11 months ago

try to extract image from this PDF file . page 6 image return "not enough image data page " page 9 and page 11 's images get extracted from this PDF file

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.2.5

$ python -c "import pypdf;print(pypdf._debug_versions)"
# pypdf==3.17.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.0.0

## Code + PDF
[Addressing_Adversarial_Attacks.pdf](https://github.com/py-pdf/pypdf/files/13501846/Addressing_Adversarial_Attacks.pdf)

```python
from pypdf import PdfReader
doc= PdfReader("./Addressing_Adversarial_Attacks.pdf")
for page_idx, page in enumerate(doc.pages):
     count = 0
     for image_file_object in page.images:

         with open(str(count) + image_file_object.name, "wb") as fp:
              fp.write(image_file_object.data)
               count += 1

Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!

Traceback

This is the complete traceback I see:

# TODO: Your traceback goes here (if applicable)
stefan6419846 commented 11 months ago

Please provide the complete traceback in the corresponding field.

jianfan123 commented 11 months ago

TODO: Your traceback goes here (if applicable)

Traceback (most recent call last): File "", line 1, in File "/opt/envs/torch/lib64/python3.8/site-packages/pypdf/_page.py", line 2717, in iter yield self[i] File "/opt/envs/torch/lib64/python3.8/site-packages/pypdf/_page.py", line 2713, in getitem return self.get_function(lst[index]) File "/opt/envs/torch/lib64/python3.8/site-packages/pypdf/_page.py", line 547, in _get_image imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id])) File "/opt/envs/torch/lib64/python3.8/site-packages/pypdf/filters.py", line 781, in _xobj_to_image img, image_format, extension, invert_color = _handle_flate( File "/opt/envs/torch/lib64/python3.8/site-packages/pypdf/_xobj_image_helpers.py", line 163, in _handle_flate img = Image.frombytes(mode, size, data) File "/opt/envs/torch/lib64/python3.8/site-packages/PIL/Image.py", line 2951, in frombytes im.frombytes(data, decoder_name, args) File "/opt/envs/torch/lib64/python3.8/site-packages/PIL/Image.py", line 804, in frombytes raise ValueError(msg) ValueError: not enough image data

pubpub-zz commented 11 months ago

extracted images for test: p5

p10

jianfan123 commented 11 months ago

Thanks you!