Closed brechtm closed 3 years ago
Confirmed. This error is font-dependent, i.e. it will happen for some fonts only - haven't yet detected any common patterns when this fails. The problem happens because of wrong or missing character width information stored with the subsetted font.
You may have followed the relevant code: I am building font subsets by deleting unused glyphs from the font, but keeping the original glyph id numbers.
What I obviously also need to do is copying over the character width dictionary (/W
) from the original to the subsetted font.
You can download a pre-version fixing this from here: https://github.com/pymupdf/PyMuPDF-wheels/tree/osx
This is the PDF subsetted with the mentioned corrections: google_fonts_subsetted.pdf
Wow! Many thanks for the quick fix.
As a test I've diffed that PDF with the original, and it's indeed identical.
As an aside question: Are you considering to use PyMuPDF as the PDF output vehicle for rinohtype?
Here is a script using PyMuPDF to confirm equality of each single character, span, line and block position: diff-checker.zip
Are you considering to use PyMuPDF as the PDF output vehicle for rinohtype?
No, sorry. 😄 I want to keep dependencies to a minimum and pure-Python. It's just that rinohtype doesn't support font subsetting (yet), so I was looking for a tool to do this post rendering.
Here is a script using PyMuPDF to confirm equality of each single character, span, line and block position: diff-checker.zip
Thanks! That might come in handy in the future.
Update: I just noticed that small caps are missing from the subsetted PDF.
small caps are missing from the subsetted PDF
Interesting point! That may be a major gap of mine. May I ask for another example file?
Thanks! That might come in handy in the future.
My example was grotesquely complicated. Really sufficient would be just one statement:
assert doc1[0].get_text("dict") == doc2[0].get_text("dict")
Here's a PDF with some small capitals: google_fonts.pdf
Thanks for the example ... but it does work (in this case). small-caps-subset.pdf
Sorry about that! I was in a rush and didn't check with that particular file (I noticed it in a file that I can't share here). This one does show the issue: google_fonts.pdf
One difference between this and the last google_fonts.pdf is that this one has a font with TrueType outlines, where the last had CFF outlines.
This one is interesting. As you may have noticed, the heart of the subsetting logic is determining the subset of used unicodes per font. To this end, I am extract all text and store the detected unicodes in a list, which I then hand over to fontTools.
In the current case however, text extraction fails for the small caps (a base library incapability). When this happens, I am receiving the error unicode 0xfffd for each unrecognized character. In consequence, the subsetted output has nothing to show for the respective character places.
So what I need to do is excluding the font from subsetting, when encountering error unicodes in text extraction.
Interestingly, MuPDF does recognize the small caps glyphs though. It just does not reflect them as unicodes in text extraction:
<fill_text colorspace="DeviceRGB" color="0 0 0" transform="1 0 0 -1 85.03937 138.37537">
<span font="PlayfairDisplay-Regular" wmode="0" bidi="0" trm="10 0 0 10">
<g unicode="S" glyph="S" x="40.827764" y="-24.15" adv=".54"/>
<g unicode="�" glyph="m.smcp" x="46.227766" y="-24.15" adv=".851"/>
<g unicode="�" glyph="a.smcp" x="54.737764" y="-24.15" adv=".616"/>
<g unicode="�" glyph="l.smcp" x="60.897764" y="-24.15" adv=".573"/>
<g unicode="�" glyph="l.smcp" x="66.62776" y="-24.15" adv=".573"/>
<g unicode=" " glyph="space" x="72.357769" y="-24.15" adv=".249"/>
<g unicode="C" glyph="C" x="75.287769" y="-24.15" adv=".688"/>
<g unicode="�" glyph="a.smcp" x="82.16776" y="-24.15" adv=".616"/>
<g unicode="�" glyph="p.smcp" x="88.32777" y="-24.15" adv=".566"/>
<g unicode="�" glyph="i.smcp" x="93.94777" y="-24.15" adv=".339"/>
<g unicode="�" glyph="t.smcp" x="97.33777" y="-24.15" adv=".631"/>
<g unicode="�" glyph="a.smcp" x="102.997768" y="-24.15" adv=".616"/>
<g unicode="�" glyph="l.smcp" x="109.15777" y="-24.15" adv=".573"/>
<g unicode="�" glyph="s.smcp" x="114.88777" y="-24.15" adv=".535"/>
</span>
In the above, the glyph names are correctly shown "a.smcp" = "'a' small caps". So in a future version I might take the above output instead of my text extraction for shaping the input to fontTools (which can also digest glyph names).
It was a no-brainer change: small-caps-subset2.pdf The corresponding output log looks like this:
>>> doc=fitz.open("google-fonts-sc2.pdf")
>>> doc.subset_fonts()
Subset built for 'Asap-Bold'.
Cannot subset 'PlayfairDisplay-Regular'.
Subset built for 'PlayfairDisplay-Italic'.
Subset built for 'PlayfairDisplay-Bold'.
Subset built for 'RobotoMono-Regular'.
Subset built for 'RobotoMono-Medium'.
981244
>>> doc.ez_save("small-caps-subset2.pdf")
>>>
Hmm, rinohtype could be to blame here. Copying (clipboard) the 'Small Capitals' string from both PDFs (TTF and CFF) produces a proper string (LATIN LETTER SMALL CAPITAL unicodes) for the CFF version but gibberish (SUPPLEMENTARY_PRIVATE_USE_AREA unicodes) for the TTF version. I haven't looked into how this is specifically handled in the code.
It is probably safer to perform subsetting based on the glyphs, not the unicode character codes. Otherwise, I suspect you may be dependent on ~the ToUnicode mapping defined for the font, which could be dependent on the PDF producer~ which specific (unicode) character codes the font maps these glyphs to.
It would be interesting to analyze a PDF with (true) small capitals from these same fonts produced with another application:
I understand. But from a PyMuPDF perspective I am actually pleased to see a case for how to improve it. As I wrote: the glyph-id is detectable - just not with the means I am currently using: text extraction. I need to use / make a MuPDF device, which outputs glyph-ids as well. For the purpose of subsetting, I need not bother about text position, color, or other crab. Just the font and the set of glyphs used from it.
analyze a PDF with (true) small capitals
This is another missing feature in PyMuPDF: MuPDF does offer a text output option "use small caps if present in font". So for me our conversation is important ... 👍
I was able to find a solution: I now hand over glyph ids to fontTools. This addresses your issue: small-caps-subset2.pdf I am curently building another set of pre-version wheels, so you can download your OSX version soon for more tests if you want.
Interesting to see, that with the new approach the subsetted PDF is even smaller for the subset of the "well-behaving" PDF too. Not sure why this is so ...
OSX wheels are available now.
good luck!
I'm still seeing small caps disappear in a document (that I can't share publicly) featuring the Segoe UI font. It takes some extra effort to create a small test file, so I haven't done that yet.
The wheels I'm using seem to indicate that it's not the latest version you created 4 days ago though:
PyMuPDF 1.18.15: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-06-07 11:05:35.
I disabled pip's caching using --no-cache-dir
, but that didn't help...
I did not change the relevant part after what you've got. Except I am now making a dynamic decision for switching between unicode-based and glyph-based subsetting: standard is unicode, and I am switching to glyphs once I encounter an error unicode 0XFFFD (65533). I am not even sure if this is necessary, but I remember that the unicode <-> glyph-id relationship is not one-to-one in a font.
Don't know if it helps: Just built support of small caps in PyMuPDF and tried outputting Segoe UI with it on Windows. Result: None of the Segoe UI fonts on Windows 10 supports / has small caps ...
I didn't have time to look into the issue further. I've sent you a PDF with Segoe UI small caps by email. You can use fonttools to dump the font to TTX and search for the smcp feature tag to see that it does include small caps.
I have generated new wheels. To test them, download your OSX wheel from here. It correctly subsetting the PDF you sent to my e-mail. For a reason unknown to me, subsetting Wingdings Regular does not work. This font is used with only one glyph in your PDF (id 21 = unicode 61490 / 0xF032), which I pass correctly to fontTools subsetter. But this guy outputs a font subset with zero glyphs! After intercepting this condition, the produced output includes a not subsetted Wingdings, so the file looks good.
Forgot to follow up on this... I can confirm that the Segoe UI small caps subsetting now works correctly. The Wingdings glyph is missing, indeed.
The Wingdings font is different in that it uses Symbol encoding (see OpenType cmap table). I was able to make pyftsubset subset this font correctly by passing it the --symbol-cmap
option. I think fonttools drops this encoding table by default to save some space.
"symbol-cmap"
Is a good hint though, thanks! I will try it out - the parameter does exist may - hopefully - lead to success.
Fixed in new version finally.
will try it out - the parameter does exist may - hopefully - lead to success.
btw: still does not work, but the mentioned workaround does.
Describe the bug (mandatory)
A clear and concise description of what the bug is.
Subsetting fonts using
subset_fonts()
in PDFs produced by rinohtype changes the horizontal placement of individual glyphs. An example PDF with embedded fonts is google_fonts.pdf.To Reproduce (mandatory)
Screenshots (optional)
Your configuration (mandatory)