pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.1k stars 491 forks source link

How to remove a *UNUSED* font from a PDF file using pyMuPdf? #313

Closed bschollnick closed 5 years ago

bschollnick commented 5 years ago

Howdy...

There are examples in pyMuPDF (eg. https://github.com/pymupdf/PyMuPDF/wiki/How-to-Extract-Fonts-from-a-PDF) that will list the fonts in PDF file.

But I'm in odd situation where a small subset of PDF files were somehow created in a manner, that they are requesting the Asian font pack to be installed. But these documents do NOT use the fonts.

So, is there a method where we could use pyMuPDF to remove the font declaration in these PDF files.

I've manually edited the file, and it appears simply removing this object / stream:

13 0 obj <</BaseFont /#82l#82r#96#BE#92#A9 /CIDSystemInfo <</Ordering (Japan1) /Registry (Adobe) /Supplement 2>> /DW 1000 /FontDescriptor 14 0 R /Subtype /CIDFontType2 /Type /Font /W [231 389 500 631 631 500] /WinCharSet 128>> endobj

Is enough to resolve the problem, which reinforces the belief that font package is not being used.

Any advice, or pointers would be welcome.

2019-06-20 09_26_14-Document Properties

bschollnick commented 5 years ago

I forgot to mention, that I have attempted to use doc.write(garbage=3, deflate=1, clean=1) with no luck.
I attempted to use FileOptimizer64 with no luck as well.

bschollnick commented 5 years ago

Using getXrefStream, I am able to find the graphics in the file.
But, 90% of the XrefStreams are "empty", only returning None. And I can't identify the fonts from the streams that are being returned. (I had hoped that I could just update the stream with the font, and turn it into a "empty" stream)

bschollnick commented 5 years ago

Okay, am I missing an antithesis to "insertFont"? Eg. removeFont?

It would be fairly simple to just run removeFont on each page, and then save the file...

bschollnick commented 5 years ago

Document._deleteObject(xref) looks promising... Now to try to find the fonts xref reference.

bschollnick commented 5 years ago

Alright.

Self solved. Using a mixture of the Font Listing Demo/example.

For anyone else wanting to do this:

  1. Load the document (eg. doc = fitz.open("<filename")
  2. Load the first page (eg. page = doc.page(0) )
  3. assumption The font is on all pages of the document, if not, you will need to increment through the document until you find a page that has the font. In our case, it was built on all pages. Deleting on the 1st page, appears to remove it from all pages.
  4. Get font list from the page (eg. fonts = page.getFontList())
  5. Increment through the font list, checking element 5 for the font signature (eg. '90ms-RKSJ-H')
  6. If signature found, delete from pdf (eg. doc._deleteObject(xref # from step 5, element 0)
  7. If deleted, then resave the PDF, either as the original filename, or as a different filename.

I need to complement you again on pyMuPDF, it's always (so far) been able to solve all our problems, and sometimes it's required quite a bit of digging, but it's a fantastic package.

JorjMcKie commented 5 years ago

Hi there, you were faster than I! Great job! And thanks for the kind feedback ... Did the binary font files also get deleted with your approach?

Just want to make you aware that PDF object definitions are now returned in a format that is easier to interpret (line breaks, indentations, fixed number of spaces between tokens and around brackets, ...):

>>> xref=doc[15].xref  # xref of page 15
>>> print(doc._getXrefString(xref))
<<
  /Type /Page
  /Contents 655 0 R
  /Resources 653 0 R
  /MediaBox [ 0 0 612 792 ]
  /Parent 636 0 R
>>
>>>
bschollnick commented 5 years ago

They were not embedded fonts, which was the problem. Acrobat was refusing to show the PDF, until the font pack was installed, and the Citrix Server team didn't want to install the Font pack on the Citrix cloud.

bschollnick commented 5 years ago

Okay, looks like there might be a issue caused by deleting the xrefs? Or have I implemented it improperly?

def get_all_fonts(pdf_doc=None, filename=None): """ pdf_doc - the pyMuPDF document """ if pdf_doc == None and filename != None: pdf_doc = fitz.open(filename) elif pdf_doc == None and filename == None: return None

font_list = []
for pgno in range(0, pdf_doc.pageCount):
    listings = pdf_doc.getPageFontList(pgno)
    for entry in listings:
        if entry[5].upper() not in font_list:
            font_list.append(entry[5].upper())
return font_list

def remove_font(pdf_doc=None, filename=None, font_name_list=[]): """ pdf_doc - the pyMuPDF document font_name - A list of strings to match for element 5

If I use remove_font to remove the fonts from a document. as in the example in the docstring above, and then perform a doc.getPageFont(0) (or load the document after it's been saved), I receive:

doc = fitz.open("Test2.pdf") doc.getPageFontList(0) warning: not a font dict (12 0 R)warning: not a font dict (13 0 R)[]

So I'm unclear on how to resolve that issue... Any Suggestions or advice would be appreciated...

JorjMcKie commented 5 years ago

Bad internet – again (5th time just today ☹). So submitting an e-mail in the hope it will get through some time …


Oh no, you did well - just not enough :-)

I am amazed btw that you se these messages. They should be suppressed and stored away in an area accessible via fitz.TOOLS.fitz_stderr.

The reason is that the PDF page definition still references the font's xref number. Example from an arbitrary PDF:

page = doc[0] print(doc._getXrefString(page.xref)) << /Contents 40 0 R /Type /Page /MediaBox [ 0 0 595.32 841.92 ] /Rotate 0 /Parent 12 0 R /Resources << /ExtGState << /R7 26 0 R

/Font << % the following will not auto-disappear if one of the font xrefs is deleted /R8 27 0 R /R10 21 0 R /R12 24 0 R /R14 15 0 R /R17 4 0 R /R20 30 0 R /R23 7 0 R /R27 20 0 R

/ProcSet [ /PDF /Text ]

/Annots [ 55 0 R ]

You could update this object after deleting the line of the respective deleted font xref, e.g.

page_obj_lines = doc._getXrefString(page.xref).splitlines() # read page obj as a list of lines new_lines = [] # receives the lines without reference to the font for l in page_obj_lines: if " %i 0 R" % font_xref in l: # skip line with the font reference continue new_lines.append(l) new_page_obj = "\n".join(new_lines) doc._updateObject(page.xref, new_page_obj)

Jorj


Von: Benjamin Schollnick notifications@github.com Gesendet: Thursday, June 20, 2019 12:22:50 PM An: pymupdf/PyMuPDF Cc: Jorj X. McKie; Comment Betreff: Re: [pymupdf/PyMuPDF] How to remove a UNUSED font from a PDF file using pyMuPdf? (#313)

Okay, looks like there might be a issue caused by deleting the xrefs? Or have I implemented it improperly?

def get_all_fonts(pdf_doc=None, filename=None): """ pdf_doc - the pyMuPDF document """ if pdf_doc == None and filename != None: pdf_doc = fitz.open(filename) elif pdf_doc == None and filename == None: return None

font_list = []

for pgno in range(0, pdf_doc.pageCount):

listings = pdf_doc.getPageFontList(pgno)

for entry in listings:

    if entry[5].upper() not in font_list:

        font_list.append(entry[5].upper())

return font_list

def remove_font(pdf_doc=None, filename=None, font_name_list=[]): """ pdf_doc - the pyMuPDF document font_name - A list of strings to match for element 5

example:

>>> import pdf_utilities

>>> filename = r"C:\15383_01-23-2019_g2documentupld_Insurance_Card.pdf"

>>> import fitz

>>> doc = fitz.open(filename)

>>> font_list = ['90ms-RKSJ-V', '90ms-RKSJ-H']

>>> doc = pdf_utilities.remove_font(doc, font_list)

>>> doc.save("test2.pdf")

"""

if pdf_doc == None and filename != None:

pdf_doc = fitz.open(filename)

elif pdf_doc == None and filename == None:

return None

font_name_list = [x.upper() for x in font_name_list]

for pgno in range(0, pdf_doc.pageCount):

listings = pdf_doc.getPageFontList(pgno)

for entry in listings:

    if entry[5].upper() in font_name_list:

        pdf_doc._deleteObject(entry[0])

return pdf_doc

If I use remove_font to remove the fonts from a document. as in the example in the docstring above, and then perform a doc.getPageFont(0) (or load the document after it's been saved), I receive:

doc = fitz.open(filename) doc = fitz.open("Test2.pdf") doc.getPageFontList(0) warning: not a font dict (12 0 R)warning: not a font dict (13 0 R)[]

So I'm unclear on how to resolve that issue... Any Suggestions or advice would be appreciated...

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/313?email_source=notifications&email_token=AB7IDIT5MEC4IDL4NBBWYTDP3OVFVA5CNFSM4HZTDGP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYF5K4A#issuecomment-504092016, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AB7IDIQSEOALJULR2FS2PG3P3OVFVANCNFSM4HZTDGPQ.

bschollnick commented 5 years ago

Hmm... Okay, I am following you. The font references are still being pointed at, effectively with now a null pointer... I'm really surprised that Adobe Acrobat isn't complaining when opening the files... (But they have a habit of silently fixing documents that have subtle and non-fatal errors).

But, I must be missing something, since I'm not seeing the font in the page.xref?

import fitz filename = r"C:\Users\bschollnick.URMC-SH\Desktop\15383_01-23-2019_test.pdf" import fitz doc = fitz.open(filename) doc.getPageFontList(0) [[12, 'n/a', 'Type0', '?l?r????', 'F10', '90ms-RKSJ-H'], [13, 'n/a', 'Type0', '?l?r????', 'F11', '90ms-RKSJ-V']] page = doc[0] print(doc._getXrefString(page.xref)) <</Type/Page/MediaBox[0 0 237 131]/Rotate 8 0 R/Contents 6 0 R/Resources 7 0 R/Parent 4 0 R>> page2 = doc[1] print(doc._getXrefString(page2.xref)) <</Type/Page/MediaBox[0 0 237 131]/Rotate 19 0 R/Contents 17 0 R/Resources 18 0 R/Parent 4 0 R>>

The fonts are in xref 12 and 13, and I'm not seeing those in either of the pages xrefs?

Now, exploring, if I examine the xrefs directly, not at the page level, I can find the references..

eg.

doc._getXrefString(7) '<</Font<</F10 12 0 R/F11 13 0 R>>/XObject<</Im0 9 0 R>>/ProcSet[/PDF/Text/ImageB]>>'

I'm not seeing an obvious way to update the doc level xrefs?

But after experimenting, this appears to work.

font_name_list = [x.upper() for x in font_name_list]
for pgno in range(0, pdf_doc.pageCount):
    page = pdf_doc.loadPage(pgno)
    listings = page.getFontList()
    for entry in listings:
        if entry[5].upper() in font_name_list:
            pdf_doc._deleteObject(entry[0])
            page._cleanContents()

Calling page._cleanContents() seems to be removing the reference. Is there something that I am missing that could backfire on me with using _cleanContents()?

              - Benjamin
JorjMcKie commented 5 years ago

Hm ... I was afraid of seeing this.

You were unfortunate enough to have a PDF feature in front of you: In PDF, it is always possible to not putting the value for a (PDF) dictionary key (like /Resources is one) directly into that dictionary, but instead an indirect to yet another xref-fed object. So you have /Resources 7 0 R. PDF resources are dictionaries themselves, and the respective value has been put in xref 7. There you will find the font reference you want to delete.

And still you may be even more unfortunate: Object 7 may in turn make use of this indirect reference feature and you might find that the /Fonts dict key's value there (a PDF array this time) again points to another xref ... :-(

JorjMcKie commented 5 years ago

Maybe in your case it is faster to just scan over all XREFs of the document and delete the occurrence of the unwanted font. Could work like so:

xreflen = doc._getXrefLength()  # number of all xrefs
for xref in range(1, xreflen):  # do not use xref 0!
    indicator = False  # set to true if font xref is referenced
    new_obj_lines = []
    obj_lines = doc._getXrefString(xref).splitlines()  # get xref source split in lines
    for line in obj_lines:
        if " %i 0 R" % font_xref in line:
            indicator = True  # update this object!
            continue
        new_obj_lines.append(line)

    if indicator is True:  # must update this object
        new_obj = "\n".join(new_obj_lines)
        doc._updateObject(xref, new_obj)

doc.save ...

This should spare you a lot of complexity I suppose. The duration is probably a bit longer because all xrefs are scanned ...

bschollnick commented 5 years ago

What about the page._cleanContents() work around that I used, it seems to be working, without any issue, and seems to be fairly clean?

font_name_list = [x.upper() for x in font_name_list]
for pgno in range(0, pdf_doc.pageCount):
    page = pdf_doc.loadPage(pgno)
    listings = page.getFontList()
    for entry in listings:
        if entry[5].upper() in font_name_list:
            pdf_doc._deleteObject(entry[0])
            page._cleanContents()

I don't mind scanning all the xrefs as you indicated, I'm just curious if there is some reason we shouldn't go with the cleanContents functionality, it seems to be resolving the issue without the complexity... (Of course, that complexity is in the library, and not my code)

JorjMcKie commented 5 years ago

No, perfect if cleanContents() works. To be honest, I am a little surprised that it does so seamlessly ... But as per its description it should exactly do this type of job. It is also based directly on MuPDF C routines, so it deserves confidence. BTW the same is happening if you use the CLI command mutool clean -c ..., perhaps combined with garbage collection like mutool -cggg .... But packing everything in one and the same Python script is of course more elegant and compact.

I have seen cleanContents() produce unreadable PDF pages on rare occasions, that is where my reservation comes from I guess. But if it works - perfect!

Keep my workaround as a backup if you want.

JorjMcKie commented 5 years ago

Some users also value incremental saves a lot and consequently dislike large update deltas. In that case, too, my code snippet my be helpful because it only changes the required stuff, whereas cleanContents() changes the page object and always reformulates, resp. combines the page's /Contents object(s) - thus creating a larger delta size.

JorjMcKie commented 5 years ago

Maybe you want to write a little recipe and put it on the Wiki? I would also include your contribution in the next release of the documentation ...

bschollnick commented 5 years ago

Sure I'll take a look at writing it up... Be glad to help...