pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.36k stars 510 forks source link

Subsetting fonts in some PDFs messes up horizontal glyph placement #1081

Closed brechtm closed 3 years ago

brechtm commented 3 years ago

Describe the bug (mandatory)

A clear and concise description of what the bug is.

Subsetting fonts using subset_fonts() in PDFs produced by rinohtype changes the horizontal placement of individual glyphs. An example PDF with embedded fonts is google_fonts.pdf.

To Reproduce (mandatory)

import fitz
d = fitz.open('google_fonts.pdf')
d.subset_fonts()
d.save('gf_subset.pdf', garbage=3, deflate=True)
$ python subset.py
Subset built for 'Asap-Bold'.
Subset built for 'NotoSerif'.
Subset built for 'NotoSerif-Italic'.
Subset built for 'NotoSerif-Bold'.
Subset built for 'RobotoMono-Regular'.
Subset built for 'RobotoMono-Medium'.

Screenshots (optional)

image

Your configuration (mandatory)

3.9.5 (default, May 26 2021, 10:31:42)
[Clang 12.0.0 (clang-1200.0.32.29)]
 darwin

PyMuPDF 1.18.14: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-06-01 08:11:38.
Built for Python 3.9 on darwin (64-bit).
Collecting pymupdf
  Downloading PyMuPDF-1.18.14-cp39-cp39-macosx_10_9_x86_64.whl (5.6 MB)
     |████████████████████████████████| 5.6 MB 3.6 MB/s
Installing collected packages: pymupdf
Successfully installed pymupdf-1.18.14
JorjMcKie commented 3 years ago

Confirmed. This error is font-dependent, i.e. it will happen for some fonts only - haven't yet detected any common patterns when this fails. The problem happens because of wrong or missing character width information stored with the subsetted font.

You may have followed the relevant code: I am building font subsets by deleting unused glyphs from the font, but keeping the original glyph id numbers. What I obviously also need to do is copying over the character width dictionary (/W) from the original to the subsetted font.

JorjMcKie commented 3 years ago

You can download a pre-version fixing this from here: https://github.com/pymupdf/PyMuPDF-wheels/tree/osx

JorjMcKie commented 3 years ago

This is the PDF subsetted with the mentioned corrections: google_fonts_subsetted.pdf

brechtm commented 3 years ago

Wow! Many thanks for the quick fix.

As a test I've diffed that PDF with the original, and it's indeed identical.

JorjMcKie commented 3 years ago

As an aside question: Are you considering to use PyMuPDF as the PDF output vehicle for rinohtype?

JorjMcKie commented 3 years ago

Here is a script using PyMuPDF to confirm equality of each single character, span, line and block position: diff-checker.zip

brechtm commented 3 years ago

Are you considering to use PyMuPDF as the PDF output vehicle for rinohtype?

No, sorry. 😄 I want to keep dependencies to a minimum and pure-Python. It's just that rinohtype doesn't support font subsetting (yet), so I was looking for a tool to do this post rendering.

Here is a script using PyMuPDF to confirm equality of each single character, span, line and block position: diff-checker.zip

Thanks! That might come in handy in the future.

brechtm commented 3 years ago

Update: I just noticed that small caps are missing from the subsetted PDF.

JorjMcKie commented 3 years ago

small caps are missing from the subsetted PDF

Interesting point! That may be a major gap of mine. May I ask for another example file?

Thanks! That might come in handy in the future.

My example was grotesquely complicated. Really sufficient would be just one statement:

assert doc1[0].get_text("dict") == doc2[0].get_text("dict")

brechtm commented 3 years ago

Here's a PDF with some small capitals: google_fonts.pdf

JorjMcKie commented 3 years ago

Thanks for the example ... but it does work (in this case). small-caps-subset.pdf

brechtm commented 3 years ago

Sorry about that! I was in a rush and didn't check with that particular file (I noticed it in a file that I can't share here). This one does show the issue: google_fonts.pdf

One difference between this and the last google_fonts.pdf is that this one has a font with TrueType outlines, where the last had CFF outlines.

JorjMcKie commented 3 years ago

This one is interesting. As you may have noticed, the heart of the subsetting logic is determining the subset of used unicodes per font. To this end, I am extract all text and store the detected unicodes in a list, which I then hand over to fontTools.

In the current case however, text extraction fails for the small caps (a base library incapability). When this happens, I am receiving the error unicode 0xfffd for each unrecognized character. In consequence, the subsetted output has nothing to show for the respective character places.

So what I need to do is excluding the font from subsetting, when encountering error unicodes in text extraction.

JorjMcKie commented 3 years ago

Interestingly, MuPDF does recognize the small caps glyphs though. It just does not reflect them as unicodes in text extraction:

<fill_text colorspace="DeviceRGB" color="0 0 0" transform="1 0 0 -1 85.03937 138.37537">
    <span font="PlayfairDisplay-Regular" wmode="0" bidi="0" trm="10 0 0 10">
        <g unicode="S" glyph="S" x="40.827764" y="-24.15" adv=".54"/>
        <g unicode="�" glyph="m.smcp" x="46.227766" y="-24.15" adv=".851"/>
        <g unicode="�" glyph="a.smcp" x="54.737764" y="-24.15" adv=".616"/>
        <g unicode="�" glyph="l.smcp" x="60.897764" y="-24.15" adv=".573"/>
        <g unicode="�" glyph="l.smcp" x="66.62776" y="-24.15" adv=".573"/>
        <g unicode=" " glyph="space" x="72.357769" y="-24.15" adv=".249"/>
        <g unicode="C" glyph="C" x="75.287769" y="-24.15" adv=".688"/>
        <g unicode="�" glyph="a.smcp" x="82.16776" y="-24.15" adv=".616"/>
        <g unicode="�" glyph="p.smcp" x="88.32777" y="-24.15" adv=".566"/>
        <g unicode="�" glyph="i.smcp" x="93.94777" y="-24.15" adv=".339"/>
        <g unicode="�" glyph="t.smcp" x="97.33777" y="-24.15" adv=".631"/>
        <g unicode="�" glyph="a.smcp" x="102.997768" y="-24.15" adv=".616"/>
        <g unicode="�" glyph="l.smcp" x="109.15777" y="-24.15" adv=".573"/>
        <g unicode="�" glyph="s.smcp" x="114.88777" y="-24.15" adv=".535"/>
    </span>

In the above, the glyph names are correctly shown "a.smcp" = "'a' small caps". So in a future version I might take the above output instead of my text extraction for shaping the input to fontTools (which can also digest glyph names).

JorjMcKie commented 3 years ago

It was a no-brainer change: small-caps-subset2.pdf The corresponding output log looks like this:

>>> doc=fitz.open("google-fonts-sc2.pdf")
>>> doc.subset_fonts()
Subset built for 'Asap-Bold'.
Cannot subset 'PlayfairDisplay-Regular'.
Subset built for 'PlayfairDisplay-Italic'.
Subset built for 'PlayfairDisplay-Bold'.
Subset built for 'RobotoMono-Regular'.
Subset built for 'RobotoMono-Medium'.
981244
>>> doc.ez_save("small-caps-subset2.pdf")
>>> 
brechtm commented 3 years ago

Hmm, rinohtype could be to blame here. Copying (clipboard) the 'Small Capitals' string from both PDFs (TTF and CFF) produces a proper string (LATIN LETTER SMALL CAPITAL unicodes) for the CFF version but gibberish (SUPPLEMENTARY_PRIVATE_USE_AREA unicodes) for the TTF version. I haven't looked into how this is specifically handled in the code.

It is probably safer to perform subsetting based on the glyphs, not the unicode character codes. Otherwise, I suspect you may be dependent on ~the ToUnicode mapping defined for the font, which could be dependent on the PDF producer~ which specific (unicode) character codes the font maps these glyphs to.

It would be interesting to analyze a PDF with (true) small capitals from these same fonts produced with another application:

JorjMcKie commented 3 years ago

I understand. But from a PyMuPDF perspective I am actually pleased to see a case for how to improve it. As I wrote: the glyph-id is detectable - just not with the means I am currently using: text extraction. I need to use / make a MuPDF device, which outputs glyph-ids as well. For the purpose of subsetting, I need not bother about text position, color, or other crab. Just the font and the set of glyphs used from it.

JorjMcKie commented 3 years ago

analyze a PDF with (true) small capitals

This is another missing feature in PyMuPDF: MuPDF does offer a text output option "use small caps if present in font". So for me our conversation is important ... 👍

JorjMcKie commented 3 years ago

I was able to find a solution: I now hand over glyph ids to fontTools. This addresses your issue: small-caps-subset2.pdf I am curently building another set of pre-version wheels, so you can download your OSX version soon for more tests if you want.

JorjMcKie commented 3 years ago

Interesting to see, that with the new approach the subsetted PDF is even smaller for the subset of the "well-behaving" PDF too. Not sure why this is so ...

JorjMcKie commented 3 years ago

OSX wheels are available now.

JorjMcKie commented 3 years ago

good luck!

brechtm commented 3 years ago

I'm still seeing small caps disappear in a document (that I can't share publicly) featuring the Segoe UI font. It takes some extra effort to create a small test file, so I haven't done that yet.

The wheels I'm using seem to indicate that it's not the latest version you created 4 days ago though:

PyMuPDF 1.18.15: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-06-07 11:05:35.

I disabled pip's caching using --no-cache-dir, but that didn't help...

JorjMcKie commented 3 years ago

I did not change the relevant part after what you've got. Except I am now making a dynamic decision for switching between unicode-based and glyph-based subsetting: standard is unicode, and I am switching to glyphs once I encounter an error unicode 0XFFFD (65533). I am not even sure if this is necessary, but I remember that the unicode <-> glyph-id relationship is not one-to-one in a font.

JorjMcKie commented 3 years ago

Don't know if it helps: Just built support of small caps in PyMuPDF and tried outputting Segoe UI with it on Windows. Result: None of the Segoe UI fonts on Windows 10 supports / has small caps ...

brechtm commented 3 years ago

I didn't have time to look into the issue further. I've sent you a PDF with Segoe UI small caps by email. You can use fonttools to dump the font to TTX and search for the smcp feature tag to see that it does include small caps.

JorjMcKie commented 3 years ago

I have generated new wheels. To test them, download your OSX wheel from here. It correctly subsetting the PDF you sent to my e-mail. For a reason unknown to me, subsetting Wingdings Regular does not work. This font is used with only one glyph in your PDF (id 21 = unicode 61490 / 0xF032), which I pass correctly to fontTools subsetter. But this guy outputs a font subset with zero glyphs! After intercepting this condition, the produced output includes a not subsetted Wingdings, so the file looks good.

brechtm commented 3 years ago

Forgot to follow up on this... I can confirm that the Segoe UI small caps subsetting now works correctly. The Wingdings glyph is missing, indeed.

The Wingdings font is different in that it uses Symbol encoding (see OpenType cmap table). I was able to make pyftsubset subset this font correctly by passing it the --symbol-cmap option. I think fonttools drops this encoding table by default to save some space.

JorjMcKie commented 3 years ago

"symbol-cmap"

Is a good hint though, thanks! I will try it out - the parameter does exist may - hopefully - lead to success.

JorjMcKie commented 3 years ago

Fixed in new version finally.

will try it out - the parameter does exist may - hopefully - lead to success.

btw: still does not work, but the mentioned workaround does.