py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.36k stars 1.41k forks source link

Non-standard annotations are not deleted with remove_annotations() #2438

Closed whitesnakeftw closed 8 months ago

whitesnakeftw commented 9 months ago

Explanation

In this PDF, PdfWriter.remove_annotations() doesn't succeed in removing the highlighting because apparently it is stored as image ('/Subtype': '/Image', '/Type': '/XObject').

PdfWriter.remove_images(to_delete=ImageType.XOBJECT_IMAGES) succeeds of course, but it also removes the actual images. What distinguishes the two is the /SMask attribute in the highlighting. Now, I can easily fix the problem by running a regex that removes everything that's between "obj" and "endobj" when /SMask is found and then repairing the resulting PDF:

33 0 obj 
<<
/Width 1800
/BitsPerComponent 8
/SMask 89 0 R
/Height 2542
/Subtype /Image
/Length 13726800
/Type /XObject
/ColorSpace /DeviceRGB
>>
stream
....
endstream 
endobj

But I can't find a way to get pypdf to remove just the objects that have /SMask. It would be nice if we could remove all objects that have a particular ImageAttributes. Or maybe make PdfWriter.remove_annotations(subtypes=None) also remove all objects that have /SMask (I have no idea if /SMask is also used for something else though).

Code Example

Possibly something like this:

from pypdf import PdfWriter

writer = PdfWriter()
writer.remove_images(to_delete=ImageAttributes.S_MASK)
MartinThoma commented 9 months ago

Would https://github.com/py-pdf/pypdf/pull/1831 solve your issue?

whitesnakeftw commented 9 months ago

Would #1831 solve your issue?

It doesn't seem to work. I used @MrTomRod 's _utils.py and _writer.py (also had to import logger_error to make it compatible with current pypdf) and ran:

writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.remove_annotations(subtypes=None)
writer.remove_annotations(annotation_filter_function="\SMask")

But the highlighting is still there. I'm guessing that's because annotation_filter_function is a further filter to subtypes, meaning we're only filtering inside \Annots, but in my document there are no \Annots at all.

Due to my little understanding of the code, I'm not sure if defining my own function, like @MrTomRod 's provided example, would make a difference:

def is_google_link(
    page: DictionaryObject,
    annotation: ArrayObject,
    obj: DictionaryObject
) -> bool:
    try:
        uri = obj['/A']['/URI']
        return uri.startswith('https://google.com/')
    except KeyError:
        return False

but I'm guessing probably not. I believe an option similar to annotation_filter_function should possibly be implemented inside remove_objects_from_page().

pubpub-zz commented 9 months ago

From what I've seen, this file does not contain annotation but just some drawings. There is no global solution I could imagine to remove them. The "/SMask" proposed does not seem like a good idea neither. The only solution I would propose would be to loop through the resources and delete the images "FXX1".

whitesnakeftw commented 9 months ago

From what I've seen, this file does not contain annotation but just some drawings. There is no global solution I could imagine to remove them. The "/SMask" proposed does not seem like a good idea neither. The only solution I would propose would be to loop through the resources and delete the images "FXX1".

I think I'm able to loop through the resources, but how would a deletion command for that look like with the current code?

Wouldn't it be necessary to have an ImageType.FXX1 defined, and specific code to handle it inside clean_forms?

whitesnakeftw commented 9 months ago

I managed to delete the unwanted objects like this:

def remove_specific_xobjects(page, type):
    if '/XObject' in page['/Resources']:
        xobjects = page['/Resources']['/XObject']
        specific_xobjects = [key for key in xobjects.keys() if type in key]
        for key in specific_xobjects:
            del xobjects[key]  # Remove the identified XObjects

reader = PdfReader("input.pdf")
writer = PdfWriter()

# Iterate through each page of the PDF
for page_num in range(len(reader.pages)):
    page = reader.pages[page_num]  # Get current page
    remove_specific_xobjects(page, '/FXX1')  # Remove XObjects containing "/FXX1" from the page's resources
    writer.add_page(page)

# Write the modified PDF to a new file
writer.write('output.pdf')

but it's of course a rough way of doing it because it produces a damaged PDF that needs to be repaired (used Ghostscript to rebuild it). It would be nice if something like this could be implemented in the writer in a proper way.

pubpub-zz commented 9 months ago

I agree, my proposal was not good for the damages. This is a new proposal:

from pypdf import PdfWriter
from PIL import Image

w = PdfWriter(clone_from="laradiceuncompressed.pdf")
for p in w.pages:
    for i in p.images:
        if "FXX" in i.name:
            i.replace(Image.new("RGBA",(1,1)))

w.write("output.pdf")

this one should be good

whitesnakeftw commented 9 months ago

@pubpub-zz Looks neat! Only problem I have with this is that when writing the output file compressing doesn't seem efficient. It reduces size to 1/4 of the starting size (193 MiB to 49 MiB), but that's still a lot compared to the original compressed file with the FXX1 images (1.17 MiB) or to what produces Ghostscript after my remove_specific_xobjects() function (364 KiB).

I also tried to add

p.compress_content_streams(level=9)

at the end of the outer for loop, but it didn't seem to make a difference.

I realize this might have nothing to do with the original issue, so feel free to close this. Thank you. :)

Edit: if I choose the original compressed PDF as input file, the output is just 683 KiB, so pypdf behaves correctly in that scenario. The original PDF was first uncompressed with pdftk, which generated the 193 MiB file, so of course it would be pdftk's job to recompress it again, not pypdf's. Sorry for the hassle.

stefan6419846 commented 8 months ago

@pubpub-zz Is there something we should document/implement here or do you consider this resolved?

pubpub-zz commented 8 months ago

the annotation is not an annotation actually. It is just some painting over the text. I don't think any documentation is required I did not see the edit. We can close it 😀