Closed whitesnakeftw closed 8 months ago
Would https://github.com/py-pdf/pypdf/pull/1831 solve your issue?
Would #1831 solve your issue?
It doesn't seem to work. I used @MrTomRod 's _utils.py and _writer.py (also had to import logger_error
to make it compatible with current pypdf) and ran:
writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.remove_annotations(subtypes=None)
writer.remove_annotations(annotation_filter_function="\SMask")
But the highlighting is still there. I'm guessing that's because annotation_filter_function
is a further filter to subtypes
, meaning we're only filtering inside \Annots
, but in my document there are no \Annots
at all.
Due to my little understanding of the code, I'm not sure if defining my own function, like @MrTomRod 's provided example, would make a difference:
def is_google_link(
page: DictionaryObject,
annotation: ArrayObject,
obj: DictionaryObject
) -> bool:
try:
uri = obj['/A']['/URI']
return uri.startswith('https://google.com/')
except KeyError:
return False
but I'm guessing probably not. I believe an option similar to annotation_filter_function
should possibly be implemented inside remove_objects_from_page()
.
From what I've seen, this file does not contain annotation but just some drawings. There is no global solution I could imagine to remove them. The "/SMask" proposed does not seem like a good idea neither. The only solution I would propose would be to loop through the resources and delete the images "FXX1".
From what I've seen, this file does not contain annotation but just some drawings. There is no global solution I could imagine to remove them. The "/SMask" proposed does not seem like a good idea neither. The only solution I would propose would be to loop through the resources and delete the images "FXX1".
I think I'm able to loop through the resources, but how would a deletion command for that look like with the current code?
Wouldn't it be necessary to have an ImageType.FXX1
defined, and specific code to handle it inside clean_forms
?
I managed to delete the unwanted objects like this:
def remove_specific_xobjects(page, type):
if '/XObject' in page['/Resources']:
xobjects = page['/Resources']['/XObject']
specific_xobjects = [key for key in xobjects.keys() if type in key]
for key in specific_xobjects:
del xobjects[key] # Remove the identified XObjects
reader = PdfReader("input.pdf")
writer = PdfWriter()
# Iterate through each page of the PDF
for page_num in range(len(reader.pages)):
page = reader.pages[page_num] # Get current page
remove_specific_xobjects(page, '/FXX1') # Remove XObjects containing "/FXX1" from the page's resources
writer.add_page(page)
# Write the modified PDF to a new file
writer.write('output.pdf')
but it's of course a rough way of doing it because it produces a damaged PDF that needs to be repaired (used Ghostscript to rebuild it). It would be nice if something like this could be implemented in the writer in a proper way.
I agree, my proposal was not good for the damages. This is a new proposal:
from pypdf import PdfWriter
from PIL import Image
w = PdfWriter(clone_from="laradiceuncompressed.pdf")
for p in w.pages:
for i in p.images:
if "FXX" in i.name:
i.replace(Image.new("RGBA",(1,1)))
w.write("output.pdf")
this one should be good
@pubpub-zz Looks neat! Only problem I have with this is that when writing the output file compressing doesn't seem efficient. It reduces size to 1/4 of the starting size (193 MiB to 49 MiB), but that's still a lot compared to the original compressed file with the FXX1 images (1.17 MiB) or to what produces Ghostscript after my remove_specific_xobjects()
function (364 KiB).
I also tried to add
p.compress_content_streams(level=9)
at the end of the outer for loop, but it didn't seem to make a difference.
I realize this might have nothing to do with the original issue, so feel free to close this. Thank you. :)
Edit: if I choose the original compressed PDF as input file, the output is just 683 KiB, so pypdf behaves correctly in that scenario. The original PDF was first uncompressed with pdftk, which generated the 193 MiB file, so of course it would be pdftk's job to recompress it again, not pypdf's. Sorry for the hassle.
@pubpub-zz Is there something we should document/implement here or do you consider this resolved?
the annotation is not an annotation actually. It is just some painting over the text. I don't think any documentation is required I did not see the edit. We can close it 😀
Explanation
In this PDF,
PdfWriter.remove_annotations()
doesn't succeed in removing the highlighting because apparently it is stored as image ('/Subtype': '/Image', '/Type': '/XObject'
).PdfWriter.remove_images(to_delete=ImageType.XOBJECT_IMAGES)
succeeds of course, but it also removes the actual images. What distinguishes the two is the/SMask
attribute in the highlighting. Now, I can easily fix the problem by running a regex that removes everything that's between "obj" and "endobj" when/SMask
is found and then repairing the resulting PDF:But I can't find a way to get pypdf to remove just the objects that have
/SMask
. It would be nice if we could remove all objects that have a particularImageAttributes
. Or maybe makePdfWriter.remove_annotations(subtypes=None)
also remove all objects that have/SMask
(I have no idea if/SMask
is also used for something else though).Code Example
Possibly something like this: