Modifying the pixel values of images in pdfs

djr2015 commented 1 week ago

I am trying to modify the pixel values of images in pdfs and write those modifications back without changing anything else (PIL image mode, width/height, etc.) about the images other than the pixel values.

I access images like so:

# access XObject corresponding to an image --> image_object

# Get the pil image from the XObject
pdf_image = PdfImage(image_object)    
pil_image = pdf_image.as_pil_image()

# then modify pil_image --> modified_pil_image

From the pikepdf documentation and from other sources online, I've found 2 ways of writing the modified images back to the pdf:

a)

image_object.write(zlib.compress(modified_pil_image.tobytes()),filter=Name('/FlateDecode'))

This way works consistently across pretty much all types of images I've encountered so far, but has the disadvantage of blowing up the pdf size.

b)

# Get image filter (e.g. '/FlateDecode') 
filt = image_object.Filter.__str__()

# Map image filter to an image extension (e.g. 'PNG')
extensions_dict = { 
         '/JPXDecode': 'JPEG2000',
         '/FlateDecode': 'PNG',
         '/DCTDecode': 'JPEG',
         '/LZWDecode': 'JPEG'
}

with io.BytesIO() as imgByteArr:   

   # Save the image to a BytesIO object
   modified_pil_image.save(
        imgByteArr,                      # bytes IO array
        format=extensions_dict[filt],    # e.g. PNG
   )

   # Get the BytesIO buffer and write that back to the XObject
   image_object.write(imgByteArr.getvalue(),filter=Name(filt))

This way seems to write the modified image back to the pdf well in most cases and doesn't blow up the pdf size, but this sometimes results in corrupted/absent images when the XObject has certain properties. I haven't managed to find a golden rule that determines whether or not the image written back by b) will be corrupted/absent. It's failed on some RGBA images, on some images with SMasks with a Matte field, and on some images with ICC profiles though I don't see an obvious link between these failure cases.

I'm wondering if there's some way of making b) generalize better than my implementation, i.e. writing images back to pdfs all the while preserving their original format/encoding but without blowing up the pdf size?

jbarlow83 commented 1 week ago

Acrobat and Foxit both have a feature to edit images embedded in a PDF, but it is not hard to break if you look for corner cases. So even with their relatively huge commercial budgets, this is not easy to get right, which is why pikepdf doesn't try to do full round trips.

Generally speaking PDF can specify images that do not map to any common image format, so it's more that you need to check for specific conditions where it does map to a common format. In particular anything with a color space that calls for Separation or DeviceN, but you can also just call for weirdness like an image whose SMask or Mask doesn't match the dimensions of the underlying, or a low resolution RGB where the mask scales it up to high resolution. You can do things that don't make a lot of sense like DCT-encoded RGB images (normal JPEG is converted to YCbCr then encoded with DCT, which gives better compression and color fidelity).

You definitely cannot write back an RGBA image - the alpha channel needs to converted to a SMask.

There are some further complications with Form XObjects and images that are used multiple times.

If the image editor sends back an image with an embedded color profile, which PNG/JPG/TIFF can all do, you should extract the profile and save it as an ICCProfile.

img2pdf has excellent image to PDF creation code on supported images and decent handling of unsupported images. A decent strategy would actually be to pass your modified image to img2pdf then use open the PDF it creates and use Page.as_form_xobject() to inject it back.

djr2015 commented 1 week ago

Thanks @jbarlow83. I'm exploring the strategy of creating Form XObjects from the modified images via Page.as_form_xobject().

My understanding is that this strategy would involve replacing the old image's XObject with the new one I create, as opposed to overwriting its content stream via image.object.write().

So far the only way I've found to replace an XObject is via its parent page's Resources dictionary. However this dictionary indexes the image according to something like /Im0 which isn't a unique identifier unlike image_object.objgen[0]. Can you recommend/is there currently a way of updating XObjects by their object number directly?

jbarlow83 commented 1 week ago

The page's content stream call for rendering xobjects by name. It does not care if the xobject is an image or form or something else. For example it will say /Im0 Do. If you replace the key page.Resources['/Im0'] = other object the new object will be rendered there. The object number (in this particular case) does not matter.

jbarlow83 commented 1 week ago

You can use Object.replace_object if you really need to main the same object ID, but in this particular case that is not necessary.

pikepdf / pikepdf

Modifying the pixel values of images in pdfs #623