Switch to using fonttools

gmischler commented 2 years ago

Problem Currently fpdf2 uses its own ttfonts.py module to read and process the TrueType family of font files. This obviously works for what whe're doing so far, and with the most common types of such files. But there are several problems with that approach.

There's a significant variety of font file types out there that all sail under the general "TrueType" flag, but contain either different types of information or the same information stored in different ways. While we cover the most common cases, users can easily run into fonts that we are unable to support.
There can be a lot of different information to be found in a font file and we currently only support a small subset of that. I've seen several feature requests just in the last few months where the implementation would require access to more detailed information from fonts than we currently have.
Both the variety and the amount of information in font files are constantly expanding, as the standards are continuously enhanced.

Solution Fortunately, other people have already dealt with those issues. There are libraries available that can be used to access the data in font files without having to worry about how it is actually stored and how that might change over time. In the Python world, Fonttools seems to be weapon of choice. According to the description, it is implemented in pure Python, and it seems to be under very active continuous development. Fonttools actually does a lot more than what we need, what we would use is essentially just fontTools.ttLib().

Additional context

Issue #210, Discussion #411, and probably others, requesting the ability to position text vertically with more precision (in a variety of contexts). This requires access to more font metrics data than we currently use.
Issue #365, which notes that writing systems with mandatory ligatures don't render correctly (the same is true for contextual glyph selection, and possibly other features). Doing this right would require to access the substitution tables (there are several types) in supporting font files. I've had a very quick look at the file format details, and found that it would take a lot of time and effort to add that ourselfes. With fonttools, at least the data retreival would come right out of the box, even if we'd still need to figure out the actual text transformations based on that.
Issue #224, could probably be resolved by this
Apparently this is not a new topic at all, and has been discussed for years and years, even in the old fpdf repository (can't find the links right now).

Open questions

Are there any alternatives to fonttools out there that I've missed?
Given the apparently quite dynamic development over there, should we fix the dependency to a fixed major version?

Lucas-C commented 2 years ago

Thank you for initiating this issue @gmischler! I planned to do the same following discussion #411.

Currently, the ttfonts.TTFontFile class contains all the font-parsing logic. Its usage is very closely located:

among its public methods, the only ones used in other parts of the code are .getMetrics() & .makeSubset()
several attributes are also directly read on TTFontFile instance objects: .ascent, .descent, .capHeight, .flags, .bbox, .defaultWidth...
all those usages are made inside FPDF.add_font & FPDF._putfonts_

I think a starting point could be to rewrite FPDF.add_font & FPDF._putfonts_ to use the fonttools lib, and then check that we can successfully generate a PDF with text.

Then, in a second phase, care should be taken to ensure backward compatibility and try to make all the existing text-related unit tests pass with minimum visual changes.

Contributions are welcome!

gmischler commented 2 years ago

and try to make all the existing text-related unit tests pass with minimum visual changes.

If the transition is done right (and assuming the current solution works correctly), I don't see why there should be any difference in the output at all.

RedShy commented 2 years ago

Contributions are welcome!

I would like to help with this! even if I'm new to fonts and new to how to embed fonts in PDFs so I think I will make some mistakes along the road.

I think a starting point could be to rewrite FPDF.add_font & FPDF._putfonts_ to use the fonttools

I managed to use fonttools inside FPDF.add_font and drop ttfonts.py completely, I will open a draft PR for these changes.

I'm finding problems in using fonttools inside FPDF._putfonts_. I see that there is the method ttf.makeSubset() that I don't get what really does and then I see that ttfontstream is generated. From what I know, this is a sequence of bytes embedded in the PDF. Here I have some doubts:

Why ttfontstream has to be created and why we cannot embed directly the .ttf file in the PDF?
I observed that only the tables ("OS/2","cmap","cvt","fpgm","gasp","glyf","head","hhea","hmtx","loca","maxp","name","post","prep") are included in ttfontstream, the others are dropped, don't get why.

Where I can find new information to keep going? What I could do is to try to assemble the ttfontstream bytes sequence using only fonttools, not sure how to do that for now

gmischler commented 2 years ago

Why ttfontstream has to be created and why we cannot embed directly the .ttf file in the PDF?

Simply put: We only want to include the data that is actually needed to render the PDF. For 8-bit codepage based ttfs, this apparently results in the tables you list in the next point. For Unicode font files, the volume needs to get reduced further. Those can get arbitrarily large, dozens of megabytes are not uncommon. Because of that, only the glyphs that are actually used in the file are included. Each glyph gets a local index number for that purpose, which is usually different from its Unicode code point. I'm not sure if you need to worry about those details too much, though. As a first step, it should be enough to just find the currently used data through the new library. The level of abstraction between our ttfonts.TTFontFile and fonttools is likely to be quite different, so you'll have to do a little research to find the respective equivalent calls.

I observed that only the tables ("OS/2","cmap","cvt","fpgm","gasp","glyf","head","hhea","hmtx","loca","maxp","name","post","prep") are included in ttfontstream, the others are dropped, don't get why.

TTF files can contain a large number of different tables, some of which are only used on a particular OS, or serve some other special purposes. Many of the "dropped" tables could be used to help select the right glyph (eg. "gsub"), or to position it optimally (eg. kerning). These are tasks that need to happen when the file is created (if at all), so there's no benefit in including that data in there, since a PDF reader would have no use for it.

In fact, making it easier to access some of the other tables (eg. "gsub" for solving #365) is one of the primary purposes of using fonttools in the first place.

Lucas-C commented 2 years ago

I managed to use fonttools inside FPDF.add_font and drop ttfonts.py completely, I will open a draft PR for these changes.

Good job @RedShy! Were you able to produce a PDF using a font coming from a .ttf file?

there is the method ttf.makeSubset() that I don't get what really does

It comes directly from the PHP original code: https://github.com/Setasign/tFPDF/blob/master/font/unifont/ttfonts.php#L494

I couldn't explain its role clearly...

@gmischler already provided an excellent answer. I don't have more useful information to share here... There is a lot of code exploration to do. Maybe fonttools documentation would be helpful in understanding tables roles: https://fonttools.readthedocs.io/en/latest/ttLib/index.html

gmischler commented 2 years ago

Where I can find new information to keep going?

The most comprehensive information I've found on the font file format and the meaning of the various tables is from Microsoft: OpenType Specification Version 1.9 Apple also has some information: TrueType Reference Manual

RedShy commented 2 years ago

Thank you both! it’s really encouraging and motivating to receive thoroughly answers!

We only want to include the data that is actually needed to render the PDF.

It makes sense and now it’s more clear!

As a first step, it should be enough to just find the currently used data through the new library.

I managed to do that inside FPDF.add_font(). I looked at every data extracted with ttfonts.TTFontFile and searched for an equivalent data using fonttools, then I runned the test and are all green. For example this is the code I put inside FPDF.add_font(). I would like to better organize the code and make it more self explained.

# font tools
ft = ttLib.TTFont(ttffilename)

scale = 1000 / ft["head"].unitsPerEm
ascent = ft["hhea"].ascent * scale
descent = ft["hhea"].descent * scale
try:
    capHeight = ft["OS/2"].sCapHeight * scale
except AttributeError:
    capHeight = ascent
bbox = (
    f"[{ft['head'].xMin * scale:.0f} {ft['head'].yMin * scale:.0f}"
    f" {ft['head'].xMax * scale:.0f} {ft['head'].yMax * scale:.0f}]"
)
stemV = 50 + int(pow((ft["OS/2"].usWeightClass / 65), 2))
italicAngle = ft["post"].italicAngle
underlinePosition = ft["post"].underlinePosition * scale
underlineThickness = ft["post"].underlineThickness * scale

flags = 4
if ft["post"].isFixedPitch:
    flags |= 1
if ft["post"].italicAngle != 0:
    flags |= 64
if ft["OS/2"].usWeightClass >= 600:
    flags |= 262144

aw = ft["hmtx"].metrics[".notdef"][0]
defaultWidth = scale * aw

name = ft["name"].getBestFullName()

charWidths = [len(ft.getBestCmap().keys()) - 1]
for char in ft.getBestCmap().keys():
    if char in (0, 65535) or char >= 196608:
        continue

    glyph = ft.getBestCmap()[char]
    aw = ft["hmtx"].metrics[glyph][0]

    if char >= len(charWidths):
        size = (((char + 1) // 1024) + 1) * 1024
        delta = size - len(charWidths)
        if delta > 0:
            charWidths += [defaultWidth] * delta

    w = round(scale * aw + 0.001) or 65535  # ROUND_HALF_UP
    charWidths[char] = w

ttf = TTFontFile()
ttf.getMetrics(ttffilename)

assert ascent == ttf.ascent
assert descent == ttf.descent
assert capHeight == ttf.capHeight
assert bbox == (
    f"[{ttf.bbox[0]:.0f} {ttf.bbox[1]:.0f}"
    f" {ttf.bbox[2]:.0f} {ttf.bbox[3]:.0f}]"
)
assert italicAngle == ttf.italicAngle
assert stemV == ttf.stemV
assert underlinePosition == ttf.underlinePosition
assert underlineThickness == ttf.underlineThickness
assert flags == ttf.flags
assert defaultWidth == ttf.defaultWidth

After this, I wanted to do the same with FPDF._putfonts(). I see that the used data are just ttfontstream and codeToGlyph, both are initialized inside ttf.makeSubset(). So my idea was to produce them using fonttools the exact way are currently made, but I was not able to do that. It's hard for me to read that part of the code and understand what's really going on.

For now what I understood about ttfontstream is that is basically a "cleaned" font file, embedded inside the PDF that contains only the relevant information about how to render the font.

But how exactly the font has to be "cleaned" (in order to create it with fonttools)?

Which tables have to stay, which have to be dropped?
The kept tables have to be modified?
Are there other modifications to do other than working on the tables?

gmischler commented 2 years ago

But how exactly the font has to be "cleaned" (in order to create it with fonttools)?

"Use the source, luke!" :wink:

I guess there's no other way to figure it out than to step through the existing code, look where it gets its data from, and then replace that source with fonttools. Anyone else would have to go through the same steps to give you a better answer, and whoever originally wrote that code is probably not following the project anymore.

If it turns out to be too confusing for the direct approach, you could try to refactor the existing code first. Try to simplify it by farming out the code dealing with individual tables (or other data structures) to seperate methods with speaking names. In a second step, you can then transition one of those at a time. Such a refactoring might also help to simplify future modifications and extensions.

RedShy commented 2 years ago

Loved the Star Wars reference! :grin:

I guess there's no other way to figure it out than to step through the existing code, look where it gets its data from, and then replace that source with fonttools.

Okay then, I'm a bit busy these days, but I will try to do it in the following weeks

If it turns out to be too confusing for the direct approach, you could try to refactor the existing code first.

Yes it's a good idea

Lucas-C commented 2 years ago

Hi @RedShy! I'd like to rekindle this ^^ Have you been blocked by anything that I could help with ?

RedShy commented 2 years ago

Hi! Unfortunately I could not work much on this in the last days. But still I would like to give my contribution! In the following days I hope I will have more time to dedicate

Lucas-C commented 2 years ago

Given that this migration has been beautifully made by @RedShy in #477, do you think that we can cloe this @gmischler?

gmischler commented 2 years ago

Well, this task looks quite finished, so I guess we can declare it as such.

py-pdf / fpdf2

Switch to using fonttools #418