pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.52k stars 517 forks source link

misc #595

Closed harveyspecter09 closed 4 years ago

harveyspecter09 commented 4 years ago

can we embed new fonts for an existing pdf using this package?

JorjMcKie commented 4 years ago

Sure. By using one of the text insertion methods, you also reference a font one way or another. That font will then be automatically embedded in the PDF upon save.

harveyspecter09 commented 4 years ago

@JorjMcKie thanks for your quick response to be precise below are my concerns appreciate your solution at your earliest convenience.

  1. Will send realtime PDF file as my input which have embedded & unembedded fonts. Expected Output1 : Can we have all fonts embedded Expected Output2 : Suppose if there is Arial Narrow-Bold type 1 can we change it to Arial Narrow-Bold truetype font Expected Output3: Change a specific paragraph from one font to another
JorjMcKie commented 4 years ago
  1. The only unembedded fonts supported by PyMuPDF are the so-called Base-14 fonts: Courier, Helvetica, Times-Roman, each of those in weights normal, italic, bold, bold-italic (which is 12 in total) plus ZapfDingbats plus Symbol - totalling 14. All other fonts are always embedded.You can also use text insertion capabilities, which use embeddable versions of these Base-14. This must then happen using the fitz.TextWriter class.
  2. & 3. No. Although PyMuPDF has a number of low-level PDF features, which would probably lead to a solution close to this, you should be aware that text positioning in each PDF is highly dependent on minute glyph metrics, which won't lend to a simple replacement.
JorjMcKie commented 4 years ago

For topic 3 there exists some sort of an "approximation" (!) when you use redaction annotations:

  1. locate the rectangle of that paragraph.
  2. extract the text of that paragraph using page.getText("dict") which delivers all properties of each text piece ("span"). Subselect txt pieces contained in that paragraph.
  3. Add a redaction annotation comprising the paragraph's rectangle, and then execute page.apply_redactions(). This will remove the paragraph (or, rather everything inside its rectangle).
  4. Re-insert the paragraph's text from those data saved before, this time using the desired font. This will position each text piece where it has been before ... but in general it may not have the same length.
JorjMcKie commented 4 years ago

The approach in previous post can of course also be used for the full page:

  1. extract text using `page.getText("dict")
  2. write the extracted text to the corresponding page of the output using replacement fonts to your liking. The dictionary extracted in step 1 looks like this. Special care would be required for images of course ... and some more special cases. The whole approach practically amounts to rebuilding the PDF ...

You would probably need a table upon which you would base font replacement decisions.

JorjMcKie commented 4 years ago

Here is a quick draft of something that reads a PDF and write a new PDF with the following features:

repl-fonts.zip

Use it as a starting point. The following aspect are not (yet) covered:

Can probably be extended to arrive at a pretty good approximation to your intentions.

harveyspecter09 commented 4 years ago

@JorjMcKie appreciate your feedback thanks a lot

harveyspecter09 commented 4 years ago

hi @JorjMcKie hope you are doing good today. i am a learner out of college

I have tried to build a piece of code with your fitz package probably you can take a look,will be grateful & appreciate your suggestions.

Input - PDF(Helvetica,Helvetica-Bold) Expected Output - PDF(Courier-Bold Helvetica-Bold) Actual Output - PDF(['Font Type: Type0, Font Name: Courier-Bold, Encoding: Identity-H', 'Font Type: Type0, Font Name: (null), Encoding: Identity-H']) missing some characters as well compared to input.

Is there any way that you can help me in successfully fetching all data including images,drawings?

Scenario: Given a PDF , read the current font embedding and convert its encoding from one format to other. by using your awesome package am able to read current font embedding but unsure on how to change encodings any leads on its implementation?

Scenario : Given a PDF, raise exceptions for irregular font encoding(custom encoding)/irregular font embedding(unembedded fonts) for further processing.

embeddscript.zip

thanks in advance.