tobwen commented 2 years ago

Is it possible to use a font that is already embedded in the PDF? The character to be replaced (space) is already subsetted.

Right now, I'm manually changing the dict of an unpacked PDF and sanitze it afterwards... Doing it in PyMuPDF would save a lot of time.

JorjMcKie commented 2 years ago

You mean that you do not have the un-subsetted version of the font?

If so, the question cannot definitely be answered, because it depends on how the font subset set been built: if the glyphs have been renumbered in the process, it may not produce usable results. But in any case you can of course not use other characters than those contained in the subset.

Otherwise - and you seem to refer to the font replacement utilities - yes this is possible using these scripts: provide the full font version of the respective font. What I am confused about: what do you mean by "space is already replaced"?

tobwen commented 2 years ago

long story

When writing a document in Word and exporting it to PDF, the application always places useless spaces (control characters) in font Arial, e.g.

[
    ( 5, 'ttf', 'TrueType', 'ABCXYZ+MyriadPro-Regular',  'F0', 'WinAnsiEncoding', 0),
    ( 8, 'ttf', 'TrueType', 'ABCXYZ+MinionPro-Semibold', 'F1', 'WinAnsiEncoding', 0),
    (16, 'ttf', 'TrueType', 'ABCXYZ+ArialMT',            'F2', 'WinAnsiEncoding', 0)
]

This can't be turned off. To get rid of Arial, I'm using this hack.

hack

In the example, I'd replace /F2 with /F1 in the decoded PDF, santize the file and Arial is gone without a layout change. This works very well, since Myriad and Minion have the glyph space (32) already embedded/subsetted.

the problem

It seems like font-replacement wants the replacement font to be installed in the system. When I set it to this, the script complains that it cannot find the font:

[
    {
        "oldfont":[
            "ArialMT"
        ],
        "newfont":"MyriadPro-Regular",
        "info":"2 glyphs, size 8600, serifed, subset font"
    },
    {
        "oldfont":[
            "MinionPro-Semibold"
        ],
        "newfont":"keep",
        "info":"26 glyphs, size 11480, serifed, subset font"
    },
    {
        "oldfont":[
            "MyriadPro-Regular"
        ],
        "newfont":"keep",
        "info":"18 glyphs, size 7128, serifed, subset font"
    }
]

JorjMcKie commented 2 years ago

Ah got you now. I have an even better solution for this special situation. In my case (a MS Word document exported to PDF), the useless font chosen was not Arial, but Helvetica:

import fitz
doc=fitz.open("v110-changes.pdf")
page=doc[0]
from pprint import pprint
pprint(page.get_fonts())
[(12, 'ttf', 'TrueType', 'FNUUTH+Calibri-Bold', 'R8', ''),
 (13, 'ttf', 'TrueType', 'DOKBTG+Calibri', 'R10', ''),
 (14, 'ttf', 'TrueType', 'NOHSJV+Calibri-Light', 'R12', ''),
 (15, 'ttf', 'TrueType', 'NZNDCL+CourierNewPSMT', 'R14', ''),
 (16, 'ttf', 'Type0', 'MNCSJY+SymbolMT', 'R17', 'Identity-H'),
 (17, 'cff', 'Type1', 'UAEUYH+Helvetica', 'R20', 'WinAnsiEncoding'),
 (18, 'ttf', 'Type0', 'ECPLRU+Calibri', 'R23', 'Identity-H'),
 (19, 'ttf', 'Type0', 'TONAYT+CourierNewPSMT', 'R27', 'Identity-H')]
# ---------------------------------------------------------------------------
# the effect of the following: all properties of xref 13 (= font Calibri)
# are copied over to xref 17 (Helvetica), so Calibri is used where there
# was Helvetica before.
# ---------------------------------------------------------------------------
doc.xref_copy(13, 17)
doc.ez_save("x.pdf")
doc.close()

doc=fitz.open("x.pdf")
page=doc[0]
pprint(page.get_fonts())
[(11, 'ttf', 'TrueType', 'FNUUTH+Calibri-Bold', 'R8', ''),
 (12, 'ttf', 'TrueType', 'DOKBTG+Calibri', 'R10', ''), # also used below under the name R20!
 (13, 'ttf', 'TrueType', 'NOHSJV+Calibri-Light', 'R12', ''),
 (14, 'ttf', 'TrueType', 'NZNDCL+CourierNewPSMT', 'R14', ''),
 (15, 'ttf', 'Type0', 'MNCSJY+SymbolMT', 'R17', 'Identity-H'),
 (12, 'ttf', 'TrueType', 'DOKBTG+Calibri', 'R20', ''), # <=== used instead of Helvetica
 (16, 'ttf', 'Type0', 'ECPLRU+Calibri', 'R23', 'Identity-H'),
 (17, 'ttf', 'Type0', 'TONAYT+CourierNewPSMT', 'R27', 'Identity-H')]

JorjMcKie commented 2 years ago

BTW the font replacement utilities accept the following as replacement candidates:

font provides as file names - whether installed in the OS or not
or one of the font name codes available as Base-14 fonts or any code added by package pymupdf-fonts.

So if you had provided the font file of one Minion or Myriad (and maybe also replaced the respective font with it), it would have worked too. But the other solution is more elegant - and guaranteed to reduce the file size.

tobwen commented 2 years ago

I have an even better solution for this special situation.

Works perfectly. Thanks a lot. It's a pity that we have to fix MS output that way :)

I'll play around with PyMuPDF to make it more granular: Sometimes this "special character" is in the heading, then the heading font and not the body text font should be used. I think I need to step through the text, find the unwanted font, check the font surrounding it and replace it.

For example:

<head><font="Minion">2</font><font="Arial">%20</font><font ="Minion">Heading</font></head>
<p><font="Myriad">1.</font><font="Arial">%20</font><font ="Myriad">blah blah blah</font></p>

So if you had provided the font file of one Minion or Myriad (and maybe also replaced the respective font with it), it would have worked too.

Those are non-free fonts and the system isn't licensed to work with the font files.

JorjMcKie commented 2 years ago

that we have to fix MS output that way

Indeed. My example is even worse: the subset fonts seem to be created page-wise and not document-wide. This also introduces unneeded size, because of some base data exist in every font - which then will exist multiple times.

I am not sure however whether OpenOffice / LibreOffice might do a better job with their PDF export ...

To find out the number of characters making use of a specific font, you can walk through the "rawdict" output of page.get_text. To get a hold of the complete font name (including subset identifier), set fitz.TOOLS.subset_fontnames(True) before you do that. Store the info in a dict with fontname as key and the set of characters using it as the value. Any font only using space can then probably by replaced by any of the other fonts also containing space in its usage set.

tobwen commented 2 years ago

This also introduces unneeded size, because of some base data exist in every font - which then will exist multiple times.

Fortunately, this doesn't happen in my documents - don't ask me why. I believe that only Acrobat can combine the subsets.

I am not sure however whether OpenOffice / LibreOffice might do a better job with their PDF export ...

Whenever I use LO for PDF jobs, I'm totally suprised, how good it works. You can throw about any SVG or PDF into Writer, export it to PDF and it's still vector art with selectable text, etc. - I didn't check color-management (CMYK-PDF in RGB export), but I think they got a solution for this.

Any font only using space can then probably by replaced by any of the other fonts also containing space in its usage set.

Yeah, I definitely need to play around with it. I mean, for document viewing, there's nothing wrong with having Myriad between Minions, but my inner Mr. Monk goes crazy about this :)

If you want, you can close this, but I'm unsure if it's "completed" or "not planned" :?

JorjMcKie commented 2 years ago

I believe that only Acrobat can combine the subsets.

Somewhat related: PyMuPDF can produce subset fonts across the whole PDF via doc.subset_fonts().

close this, but I'm unsure if it's "completed" or "not planned" :?

Well, the font replacement scripts must be regarded as "demo" / "example" solutions - not as an official part of the PyMuPDF package. They are intended to be used as starting point for an own solution to some problem. This is true for all scripts in this repo, this is why we won't fix issues here, but only accept PRs from anyone who finds a way to improve something.

tobwen commented 2 years ago

Somewhat related: PyMuPDF can produce subset fonts across the whole PDF via doc.subset_fonts().

But probably not for "accidents" like the output you mentioned. With mixed name, type and encoding?

 (12, 'ttf', 'TrueType', 'DOKBTG+Calibri', 'R10', ''),
 (16, 'ttf', 'Type0', 'ECPLRU+Calibri', 'R23', 'Identity-H'),

By the way, I'm available on IRC, bridged from mupdf's discord. Maybe you're around.

JorjMcKie commented 2 years ago

But probably not for "accidents" like the output you mentioned. With mixed name, type and encoding?

No, I was a bit too laconic. PyMuPDF does not merge different font subsets. And making subsets is restricted to fonts,

for which package fonttools supports it,
that are not already subsetted

My discontent with MS Word's approach focusses on their apparent per-page approach in that respect - instead of building a subset based on a font's usage within the whole document.

To combine two different subsets - even if each was built from the same base font - would be something between tricky and impossible. The main reason being that you can choose to renumber the base font's glyphs - or not - in that process. The approach used in doc.subset_fonts() obviously only works, because it builds subsets while keeping each glyph's original number within the base font. Two different subsets could presumably only be merged, if the above is true for both of them.

tobwen commented 2 years ago

Hmm. I think about removing the embedded font (Acrobat can do that) and re-embedd it using PyMuPDF. But I bet that won't be so easy either...

JorjMcKie commented 2 years ago

The title "font replacement" for my scripts is actually misleading: They do not technically do that. Instead they rewrite original text using the chosen replacing font - unconditionally. Which means,

if the original text was hidden, it now no longer is
if the original text was shown under conditions controlled by Optional Content objects, then this no longer is the case.
...

Actually, real font replacement takes place only with using that doc.xref_copy trick.

pymupdf / PyMuPDF-Utilities

[font-replacement] replace with a font, which is already embedded #48

long story

hack

the problem