Closed gmischler closed 2 years ago
Thank you for initiating this issue @gmischler! I planned to do the same following discussion #411.
Currently, the ttfonts.TTFontFile class contains all the font-parsing logic. Its usage is very closely located:
.getMetrics()
& .makeSubset()
TTFontFile
instance objects: .ascent
, .descent
, .capHeight
, .flags
, .bbox
, .defaultWidth
...FPDF.add_font
& FPDF._putfonts_
I think a starting point could be to rewrite FPDF.add_font
& FPDF._putfonts_
to use the fonttools
lib,
and then check that we can successfully generate a PDF with text.
Then, in a second phase, care should be taken to ensure backward compatibility and try to make all the existing text-related unit tests pass with minimum visual changes.
Contributions are welcome!
and try to make all the existing text-related unit tests pass with minimum visual changes.
If the transition is done right (and assuming the current solution works correctly), I don't see why there should be any difference in the output at all.
Contributions are welcome!
I would like to help with this! even if I'm new to fonts and new to how to embed fonts in PDFs so I think I will make some mistakes along the road.
I think a starting point could be to rewrite
FPDF.add_font
&FPDF._putfonts_
to use the fonttools
I managed to use fonttools inside FPDF.add_font
and drop ttfonts.py
completely, I will open a draft PR for these changes.
I'm finding problems in using fonttools inside FPDF._putfonts_
. I see that there is the method ttf.makeSubset()
that I don't get what really does and then I see that ttfontstream
is generated.
From what I know, this is a sequence of bytes embedded in the PDF. Here I have some doubts:
ttfontstream
has to be created and why we cannot embed directly the .ttf
file in the PDF?("OS/2","cmap","cvt","fpgm","gasp","glyf","head","hhea","hmtx","loca","maxp","name","post","prep")
are included in ttfontstream
, the others are dropped, don't get why.Where I can find new information to keep going?
What I could do is to try to assemble the ttfontstream
bytes sequence using only fonttools, not sure how to do that for now
- Why
ttfontstream
has to be created and why we cannot embed directly the.ttf
file in the PDF?
Simply put: We only want to include the data that is actually needed to render the PDF.
For 8-bit codepage based ttfs, this apparently results in the tables you list in the next point.
For Unicode font files, the volume needs to get reduced further. Those can get arbitrarily large, dozens of megabytes are not uncommon. Because of that, only the glyphs that are actually used in the file are included. Each glyph gets a local index number for that purpose, which is usually different from its Unicode code point.
I'm not sure if you need to worry about those details too much, though. As a first step, it should be enough to just find the currently used data through the new library. The level of abstraction between our ttfonts.TTFontFile
and fonttools is likely to be quite different, so you'll have to do a little research to find the respective equivalent calls.
- I observed that only the tables
("OS/2","cmap","cvt","fpgm","gasp","glyf","head","hhea","hmtx","loca","maxp","name","post","prep")
are included inttfontstream
, the others are dropped, don't get why.
TTF files can contain a large number of different tables, some of which are only used on a particular OS, or serve some other special purposes. Many of the "dropped" tables could be used to help select the right glyph (eg. "gsub"), or to position it optimally (eg. kerning). These are tasks that need to happen when the file is created (if at all), so there's no benefit in including that data in there, since a PDF reader would have no use for it.
In fact, making it easier to access some of the other tables (eg. "gsub" for solving #365) is one of the primary purposes of using fonttools in the first place.
I managed to use fonttools inside FPDF.add_font and drop ttfonts.py completely, I will open a draft PR for these changes.
Good job @RedShy!
Were you able to produce a PDF using a font coming from a .ttf
file?
there is the method ttf.makeSubset() that I don't get what really does
It comes directly from the PHP original code: https://github.com/Setasign/tFPDF/blob/master/font/unifont/ttfonts.php#L494
I couldn't explain its role clearly...
@gmischler already provided an excellent answer. I don't have more useful information to share here...
There is a lot of code exploration to do.
Maybe fonttools
documentation would be helpful in understanding tables roles:
https://fonttools.readthedocs.io/en/latest/ttLib/index.html
Where I can find new information to keep going?
The most comprehensive information I've found on the font file format and the meaning of the various tables is from Microsoft: OpenType Specification Version 1.9 Apple also has some information: TrueType Reference Manual
Thank you both! it’s really encouraging and motivating to receive thoroughly answers!
We only want to include the data that is actually needed to render the PDF.
It makes sense and now it’s more clear!
As a first step, it should be enough to just find the currently used data through the new library.
I managed to do that inside FPDF.add_font()
. I looked at every data extracted with ttfonts.TTFontFile
and searched for an equivalent data using fonttools
, then I runned the test and are all green. For example this is the code I put inside FPDF.add_font()
. I would like to better organize the code and make it more self explained.
# font tools
ft = ttLib.TTFont(ttffilename)
scale = 1000 / ft["head"].unitsPerEm
ascent = ft["hhea"].ascent * scale
descent = ft["hhea"].descent * scale
try:
capHeight = ft["OS/2"].sCapHeight * scale
except AttributeError:
capHeight = ascent
bbox = (
f"[{ft['head'].xMin * scale:.0f} {ft['head'].yMin * scale:.0f}"
f" {ft['head'].xMax * scale:.0f} {ft['head'].yMax * scale:.0f}]"
)
stemV = 50 + int(pow((ft["OS/2"].usWeightClass / 65), 2))
italicAngle = ft["post"].italicAngle
underlinePosition = ft["post"].underlinePosition * scale
underlineThickness = ft["post"].underlineThickness * scale
flags = 4
if ft["post"].isFixedPitch:
flags |= 1
if ft["post"].italicAngle != 0:
flags |= 64
if ft["OS/2"].usWeightClass >= 600:
flags |= 262144
aw = ft["hmtx"].metrics[".notdef"][0]
defaultWidth = scale * aw
name = ft["name"].getBestFullName()
charWidths = [len(ft.getBestCmap().keys()) - 1]
for char in ft.getBestCmap().keys():
if char in (0, 65535) or char >= 196608:
continue
glyph = ft.getBestCmap()[char]
aw = ft["hmtx"].metrics[glyph][0]
if char >= len(charWidths):
size = (((char + 1) // 1024) + 1) * 1024
delta = size - len(charWidths)
if delta > 0:
charWidths += [defaultWidth] * delta
w = round(scale * aw + 0.001) or 65535 # ROUND_HALF_UP
charWidths[char] = w
ttf = TTFontFile()
ttf.getMetrics(ttffilename)
assert ascent == ttf.ascent
assert descent == ttf.descent
assert capHeight == ttf.capHeight
assert bbox == (
f"[{ttf.bbox[0]:.0f} {ttf.bbox[1]:.0f}"
f" {ttf.bbox[2]:.0f} {ttf.bbox[3]:.0f}]"
)
assert italicAngle == ttf.italicAngle
assert stemV == ttf.stemV
assert underlinePosition == ttf.underlinePosition
assert underlineThickness == ttf.underlineThickness
assert flags == ttf.flags
assert defaultWidth == ttf.defaultWidth
After this, I wanted to do the same with FPDF._putfonts()
. I see that the used data are just ttfontstream
and codeToGlyph
, both are initialized inside ttf.makeSubset()
. So my idea was to produce them using fonttools
the exact way are currently made, but I was not able to do that. It's hard for me to read that part of the code and understand what's really going on.
For now what I understood about ttfontstream
is that is basically a "cleaned" font file, embedded inside the PDF that contains only the relevant information about how to render the font.
But how exactly the font has to be "cleaned" (in order to create it with fonttools
)?
But how exactly the font has to be "cleaned" (in order to create it with
fonttools
)?
"Use the source, luke!" :wink:
I guess there's no other way to figure it out than to step through the existing code, look where it gets its data from, and then replace that source with fonttools. Anyone else would have to go through the same steps to give you a better answer, and whoever originally wrote that code is probably not following the project anymore.
If it turns out to be too confusing for the direct approach, you could try to refactor the existing code first. Try to simplify it by farming out the code dealing with individual tables (or other data structures) to seperate methods with speaking names. In a second step, you can then transition one of those at a time. Such a refactoring might also help to simplify future modifications and extensions.
Loved the Star Wars reference! :grin:
I guess there's no other way to figure it out than to step through the existing code, look where it gets its data from, and then replace that source with fonttools.
Okay then, I'm a bit busy these days, but I will try to do it in the following weeks
If it turns out to be too confusing for the direct approach, you could try to refactor the existing code first.
Yes it's a good idea
Hi @RedShy! I'd like to rekindle this ^^ Have you been blocked by anything that I could help with ?
Hi! Unfortunately I could not work much on this in the last days. But still I would like to give my contribution! In the following days I hope I will have more time to dedicate
Given that this migration has been beautifully made by @RedShy in #477, do you think that we can cloe this @gmischler?
Well, this task looks quite finished, so I guess we can declare it as such.
Problem Currently fpdf2 uses its own ttfonts.py module to read and process the TrueType family of font files. This obviously works for what whe're doing so far, and with the most common types of such files. But there are several problems with that approach.
Solution Fortunately, other people have already dealt with those issues. There are libraries available that can be used to access the data in font files without having to worry about how it is actually stored and how that might change over time. In the Python world, Fonttools seems to be weapon of choice. According to the description, it is implemented in pure Python, and it seems to be under very active continuous development. Fonttools actually does a lot more than what we need, what we would use is essentially just
fontTools.ttLib()
.Additional context
Open questions