Open AnzhiZhang opened 1 week ago
Thanks for the report. There is no real need to do any replacements here. The following code is sufficient:
>>> from pypdf import PdfReader
>>> reader = PdfReader('example.pdf')
>>> for page in reader.pages:
... for image in page.images:
... image.image.save(image.name)
...
>>>
Doing some quick tests, it seems like neither MuPDF nor poppler (through pdfimages) are able to extract the image correctly at the moment as well.
When
page.images
is used to read images, the color becomes incorrect. However, when replacing it, pypdf calls the same function to read the image again, and the image is in the correct color space. I will explain more in the issue analysis section below.Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!
example.pdf
I personally fine with adding it to test. However, this is modified from http://paper.people.com.cn/rmrb/images/2024-10/28/03/rmrb2024102803.pdf and it may have some copywrite issues. It would be better to create a new PDF file with a CMYK image if it can reproduce the issue.
Traceback
This is the complete traceback I see:
Issue Analysis
page.images
callsPageObject._get_image()
function in the_page.py
file. Alsoimg.replace()
function also calls the same_get_image()
function twice in theImageFile.replace()
byreader.pages[0].images[0]
.https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_page.py#L632-L669 https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_page.py#L398-L401
By editing the
_get_image()
function:Here is the new output:
One decode output is used when reading
page.images
, and two are called when replacing. Here is the reason of the issue: image decode is wrong when reading it.https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_page.py#L658
Now I would like to bring your attention to this function
_xobj_to_image()
infilters.py
https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/filters.py#L793
The error decode will cause an image with the wrong color space.