silnrsi / font-charis

Fonts for languages and writing systems that use the Latin and Cyrillic scripts
https://software.sil.org/charis/
SIL Open Font License 1.1
76 stars 7 forks source link

Ligatures (fi, fl, etc.) do not print correctly on pdf files #6

Closed tempapy closed 2 years ago

tempapy commented 2 years ago

Hello, With Charis SIL, but also Gentium, whenever I convert a document containing these fonts to PDF, they display correctly, but some characters are, I believe, not correctly encoded. Because if you search inside the resulting PDF file (or copy paste from it), many characters are missing: "fi", "fl", and probably more (the ligatures I guess). A picture to highlight the problem (here it is the firefox pdf viewer after a "Select All" command):

image

I believe it has to do with the fonts themselves as other fonts work and I reproduced this bug basically everywhere I tried: in Firefox Windows (print webpage to PDF), Calibre (HTML or EPUB to PDF conversion), Google Docs (PDF export).

Thank you for your help and for such nice fonts!

jvgaultney commented 2 years ago

I don't think this is a font problem. Those fonts have a built-in feature that replaces f+i and f+l with glyphs that are ligatures. Some apps (like InDesign) activate that feature automatically. Others, like Word need you to explicitly activate it (Fonts/Advanced). I have tested both those apps and the PDFs that they produce with multiple PDF readers and they do not show the problem you report.

It likely has to do with the process you're using to produce the PDFs, and is complicated as it sounds like you're trying to print a web page. A PDF is internally encoded as a set of glyphs - not characters - so f+i would be a single glyph (fi). Most PDF writers (but not all, esp. those that 'print' web pages) also include the original character stream. This enables PDF readers to access the original characters, not just the glyphs. But not all readers do that. Ideally you could select the 'fi' glyph in a PDF reader, then copy and paste into a text editor and you'd get f+i. Or if you search for 'f' or 'i' or 'fi' in the PDF it would highlight that ligature. If anything breaks down in that process you may have problems.

Where have you obtained the font? From our web site or Google Fonts or elsewhere? What app are you using the font in? What OS? Have you tried some other PDF viewer than the minimal one in Firefox? Do any of the fonts that do work for you have OpenType ligature features? If you give specific details I can try to look into it further.

tempapy commented 2 years ago

Thank you very much, you are right, I realized I only tried other fonts that do not have ligatures... So the problem is elsewhere: PDF producers and PDF readers. Sorry for incriminating your fonts!

Anyway, I'll add here some context for anyone who may look for answers to related issues in the future: Some tools export ligatures to PDF properly, such as Word indeed. But others don't, the biggest culprit being probably chromium, which is everywhere. When it comes to PDF readers, some can compensate the problem, it seems: chromium do, and Adobe Reader too but only partially (copy pasting works but not searching...). The funny thing is that chromium messes up the ligature when producing the PDF, but then finds a way to parse it properly, which, taken to the extreme, means that chromium produced PDFs are locked inside a chromium environment.

The only workaround I found is disabling ligatures altogether from the source HTML file, i.e. using the css rule "font-variant-ligatures: none;" in its styling sheet, which, for my use case, is acceptable.

jvgaultney commented 2 years ago

That's all very helpful to know - thanks for sharing the details here. I'm sure others will appreciate your explanation and workaround.

kenmcd commented 2 years ago

This also depends on how the application created the PDF. Typically OpenType generated ligatures have no Unicode code point. When you do copy/paste it is using the ToUnicode table in the PDF to map the GIDs (GlyphIDs) to the Unicode character codes. When OpenType ligatures are used the applications often do not provide a proper Unicode codes - the fi glyph could be encoded as a space or some other odd character. And that is what you see when you cut/paste. Some applications will include the code for both characters (f and i). In that case cut/paste works, search works, and screen readers work. If you provide the PDF, we can see what is actually there. Simply displaying the PDF is not an issue, as PDF readers just show the shapes.

Also note that some applications such as Word and LibreOffice have auto-correct features which will replace certain characters with old legacy Unicode ligature characters. For example the separate characters for fandi will be auto-corrected to the single fi ligature character (FB01). When that character is output to PDF it does get encoded properly in the ToUnicode table because it does have a Unicode code point.

Claris SIL has two different glyphs for the fi ligature. The OpenType f_i ligature has no Unicode code point, and is GID 739. The old legacy fi ligature character is code point FB01, and GID 192. Either could end up in your PDF depending on the application and its settings. Some applications' search features are smart enough to see the fi (FB01) character as separate f and i characters - so their search works.

I have only seen one example where the application coded the OpenType and character ligatures in a manner where everything works - cut/paste, import/open for text edit, search, and screen reader. Most of the time you will have issues with ligatures in PDFs at some point if you try to do anything other than just view/read it.

The easiest work around, as you have discovered, is to disable ligatures.

tempapy commented 2 years ago

Thank you very much for clarifying all that. Don't bother with my PDF files, from all your information I understand they must have OpenType ligatures, and I also understand that satisfying ligatures in a PDF are a lost cause anyway!