Inverted colors when extracting CMYK image

When page.images is used to read images, the color becomes incorrect. However, when replacing it, pypdf calls the same function to read the image again, and the image is in the correct color space. I will explain more in the issue analysis section below.

origin	output

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '43.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter

def replace(filename):
    writer = PdfWriter(clone_from=filename)

    for page in writer.pages:
        for img in page.images:
            img.replace(img.image)

    filename = filename.replace(".pdf", "_out.pdf")
    with open(filename, "wb") as f:
        writer.write(f)

replace("example.pdf")

Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!

example.pdf

I personally fine with adding it to test. However, this is modified from http://paper.people.com.cn/rmrb/images/2024-10/28/03/rmrb2024102803.pdf and it may have some copywrite issues. It would be better to create a new PDF file with a CMYK image if it can reproduce the issue.

Traceback

This is the complete traceback I see:

***/python.exe ***/test.py

Process finished with exit code 0

Issue Analysis

page.images calls PageObject._get_image() function in the _page.py file. Also img.replace() function also calls the same _get_image() function twice in the ImageFile.replace() by reader.pages[0].images[0].

https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_page.py#L632-L669 https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_page.py#L398-L401

By editing the _get_image() function:

a = cast(DictionaryObject, xobjs[id])
print(a.get("/Decode"))
imgd = _xobj_to_image(a)

Here is the new output:

***/python.exe ***/test.py
[0, 1, 0, 1, 0, 1, 0, 1]
[1, 0, 1, 0, 1, 0, 1, 0]
[1, 0, 1, 0, 1, 0, 1, 0]

Process finished with exit code 0

One decode output is used when reading page.images, and two are called when replacing. Here is the reason of the issue: image decode is wrong when reading it.

https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_page.py#L658

Now I would like to bring your attention to this function _xobj_to_image() in filters.py

https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/filters.py#L793

The error decode will cause an image with the wrong color space.

py-pdf / pypdf